Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic but flexible query language #308

Merged
merged 45 commits into from
Feb 28, 2024
Merged

Add basic but flexible query language #308

merged 45 commits into from
Feb 28, 2024

Conversation

yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Feb 21, 2024

An original target fake query which has various features was

((jim AND NOT haxby AND "important\" paper") OR ds_id:~"^000[3-9]..$" OR url:"example.com") AND metadata:non AND metadata[ex1,ex2]:"specific data" AND metadata[extractor2]:data

and this PR implements most of them -- check the modal window description of supported syntax:

image

  • quite flexible implementation so we could add/replace fields we are searching on
  • can search within specified metadata extractors
    • TODO (not here): search within JSON records is not part of this PR -- but likely would just allow to pass JSONPath for the most flexible functionality
    • TODO (not here): expose via API call. Expose current search via API #269
  • has help on syntax supporting boolean operations etc. Closes improve search: use AND across search terms #264
  • replacing original /search

…eries

Using  lark  we establish grammar for the query, so we could do something like

    ((jim AND NOT haxby AND "important\" paper") OR ds_id:~"^000[3-9]..$" OR url:"example.com") AND metadata:non AND metadata[ex1,ex2]:"specific data" AND metadata[extractor2]:data

so composition of booleans and queries within specific extractors etc
Also now committed with fixup pre-commit so isort kicked in etc
@yarikoptic yarikoptic added the enhancement New feature or request label Feb 21, 2024
Copy link

codecov bot commented Feb 22, 2024

Codecov Report

Attention: Patch coverage is 97.84173% with 3 lines in your changes are missing coverage. Please review.

Project coverage is 98.67%. Comparing base (2d07cb8) to head (209d424).
Report is 59 commits behind head on master.

Files Patch % Lines
datalad_registry/search.py 96.59% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #308      +/-   ##
==========================================
- Coverage   98.73%   98.67%   -0.06%     
==========================================
  Files          50       52       +2     
  Lines        2368     2492     +124     
==========================================
+ Hits         2338     2459     +121     
- Misses         30       33       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@jwodder jwodder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know lark, so I just proofread the grammar in the syntax description.

datalad_registry/templates/overview.html Outdated Show resolved Hide resolved
datalad_registry/templates/overview.html Outdated Show resolved Hide resolved
datalad_registry/templates/overview.html Outdated Show resolved Hide resolved
datalad_registry/templates/overview.html Outdated Show resolved Hide resolved
datalad_registry/templates/overview.html Outdated Show resolved Hide resolved
datalad_registry/templates/overview.html Outdated Show resolved Hide resolved
datalad_registry/templates/overview.html Outdated Show resolved Hide resolved
datalad_registry/templates/overview.html Outdated Show resolved Hide resolved
datalad_registry/templates/overview.html Outdated Show resolved Hide resolved
datalad_registry/templates/overview.html Outdated Show resolved Hide resolved
setup.cfg Show resolved Hide resolved
Copy link
Collaborator

@candleindark candleindark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are going to refer to this "search" scheme as "query" in the web API, I think it would be more consistent if we replace "search" in the naming and references related to this scheme. We can rename search.py and test_search.py into query.py and test_search.py.

datalad_registry/search.py Outdated Show resolved Hide resolved
@yarikoptic yarikoptic marked this pull request as ready for review February 22, 2024 21:57
@yarikoptic
Copy link
Member Author

I saw that coverage reported that not_expr is not covered, started to add the tests and ran into the failing test cases:

        ("NOT handbook", [1, 2, 4]),
        ("datalad AND (NOT handbook)", [2]),

which I cannot figure out why our NOT not working as it should -- just returns empty lists :-/
it happens if we do not specify which db entry to operate on, so we fall into that or_ across all known fields in _get_str_search https://github.com/datalad/datalad-registry/pull/308/files#diff-11394d8200a55a0ad0089b9d62dabe6194d7a14e711147c2b8d63a795b5aa2dbR223

    def _get_str_search(self, arg: Token) -> ColumnElement[bool]:
        value = self._get_str_value(arg)
        return or_(*[f(value) for f in known_fields.values()])

any ideas would be appreciated or otherwise might be worth disabling NOT for such cases or altogether for now.

@yarikoptic
Copy link
Member Author

interesting... toying around with dandi datasets, and NOT:
image

it points to two dandisets where dandi metadata is loaded, could potentially have that AccessRequirements but does not! And if I select it to be limited metadata: field, then I get also all the private ones for which I do not even have any metadata loaded:
image

@candleindark
Copy link
Collaborator

    def _get_str_search(self, arg: Token) -> ColumnElement[bool]:
        value = self._get_str_value(arg)
        return or_(*[f(value) for f in known_fields.values()])

Based on the intent of what you want to search, it seems that the the lo

interesting... toying around with dandi datasets, and NOT: image

it points to two dandisets where dandi metadata is loaded, could potentially have that AccessRequirements but does not! And if I select it to be limited metadata: field, then I get also all the private ones for which I do not even have any metadata loaded: image

For some reason, I can't access your posted images.
Screenshot 2024-02-23 at 5 53 52 PM

@yarikoptic
Copy link
Member Author

a user can encounter a situation that a tooltip text can obstruct the view and function of buttons.

yeah, I ran into it often too. Added 1em margin for spaceing it out now.

@yarikoptic
Copy link
Member Author

I am leaning to actually generalize DB-specific filter into search here. Pushed that unification to "search" in the web UI/parameters (query)

If this is the end goal for this feature, can we route this feature to /api/v2/dataset-urls/search when we implement the API endpoint for this feature instead of /api/v2/dataset-urls/query which you suggested in our previous zoom session?

yes, we can name it /search as it is even present in #269.

@yarikoptic
Copy link
Member Author

@candleindark I "merged" your PR. all good for a merge/deploy? ;)

yarikoptic and others added 5 commits February 28, 2024 10:41
These names are more informative. Additionally,
doc strings for these tests are also updated.
These names are appropriate now because of the
context provided the containing class
known_fields_RepoUrl_1to1 = ["url", "ds_id", "head", "head_describe", "tags"]


def get_ilike_search(model, field: str, value: str):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_ilike_search(model, field: str, value: str):
def _get_ilike_expr(model, field: str, value: str):

This name is more consistent with SQLAlchemy's convention. I also feel that this function depends heavily on the context of the containing file and should be made private.

return getattr(model, field).ilike(_escape_for_ilike(value), escape=escape)


def get_metadata_ilike_search(value):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_metadata_ilike_search(value):
def _get_metadata_ilike_expr(value):

This name is more consistent with SQLAlchemy's convention. I also feel that this function depends heavily on the context of the containing file and should be made private.

)


def get_branches_ilike_search(value):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_branches_ilike_search(value):
def _get_branches_ilike_expr(value):

This name is more consistent with SQLAlchemy's convention. I also feel that this function depends heavily on the context of the containing file and should be made private.

super().__init__(*args, **kwargs)
# Additional initialization here if needed

def or_search(self, *args) -> ColumnElement[bool]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def or_search(self, *args) -> ColumnElement[bool]:
def or_exp(self, *args) -> ColumnElement[bool]:

This name is more inline with SQLAlchemy's convention.

# args will be a list of the arguments to the OR operation
return or_(*args)

def and_search(self, *args) -> ColumnElement[bool]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def and_search(self, *args) -> ColumnElement[bool]:
def and_exp(self, *args) -> ColumnElement[bool]:

This name is more inline with SQLAlchemy's convention.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those must correspond to the names of lark grammar -- defined above at

https://github.com/datalad/datalad-registry/pull/308/files#diff-11394d8200a55a0ad0089b9d62dabe6194d7a14e711147c2b8d63a795b5aa2dbR82

I don't mind changing here and there but then go ahead change and send PR so we make sure that it works

# args will be a single-element list containing the NOT argument
return not_(arg)

def get_field_select_search(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_field_select_search(
def get_field_select_expr(

This name is more inline with SQLAlchemy's convention.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this and below -- I don't mind but make a complete change/refactoring and send as a PR or a commit making sure that all renamed/works.

raise ValueError(f"Unexpected number of args: {len(args)} in {args}")

if field_l.data == "field_select":
search = partial(self.get_field_select_search, *field_l.children)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
search = partial(self.get_field_select_search, *field_l.children)
get_query_exp = partial(self.get_field_select_search, *field_l.children)

This name is more inline with SQLAlchemy's convention.

else:
raise TypeError(arg.type)

def _get_str_search(self, arg: Token) -> ColumnElement[bool]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _get_str_search(self, arg: Token) -> ColumnElement[bool]:
def _get_str_query_exp(self, arg: Token) -> ColumnElement[bool]:

This name is more consistent the SQLAlChemy convention.

Comment on lines +185 to +188
if not extractors:
raise GrammarValueError(
f"No extractors were specified in {metadata_extractors_l}"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried, but couldn't come up with a test case for these lines. The coverage of these lines is the only thing I want before merging.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they are not even reachable per se since lark grammar wouldn't allow for such a case. So it is more of an AssertionError here in particular. But I take it as a defensive coding to code under assumption we might get there. I would have just added the skip for coverage here.

candleindark and others added 3 commits February 28, 2024 12:36
The class is unneeded for all tests in the containing files
are for testing the search functionality
@candleindark
Copy link
Collaborator

As @yarikoptic suggested, let's merge this PR now. The remaining "issues" don't effect the functionalities. They are to be addressed in #320.

@candleindark candleindark merged commit 720d08a into master Feb 28, 2024
7 of 9 checks passed
@candleindark candleindark deleted the enh-search branch February 28, 2024 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

improve search: use AND across search terms
3 participants