Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic but flexible query language #308

Merged
merged 45 commits into from
Feb 28, 2024
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
8cac1ae
ENH(WiP): add "search" language to be able to create more specific qu…
yarikoptic Feb 20, 2024
b3079b6
WiP: started to hang on the parsed content the sqlalchemy queries
yarikoptic Feb 21, 2024
be5eb9a
ENH: allow for implicit AND, initial construct for specifying metadat…
yarikoptic Feb 21, 2024
4740582
Fix name of metalad_studyminimeta extactor name
yarikoptic Feb 21, 2024
1df3800
Make search on branches work, add basic tests of queries to perform a…
yarikoptic Feb 21, 2024
7ada250
progress isort forward in pre-commit
yarikoptic Feb 21, 2024
2751ba7
More testing, various minor tune ups, testing case insensitive
yarikoptic Feb 21, 2024
f015def
Replace hardcoded search with the flexible one
yarikoptic Feb 21, 2024
395b707
ENH: Add a modal window describing the syntax for new filter
yarikoptic Feb 21, 2024
e2671f9
ENH: add codespell to pre-commit since we otherwise let typos sneak in
yarikoptic Feb 21, 2024
62d884c
Add few basic type annotations and ignore 2 in search for now
yarikoptic Feb 21, 2024
b58bd9d
Add lark==1.1.9 to requirements.txt
yarikoptic Feb 22, 2024
4b40b5f
Apply suggestions from code review: primarily wording tuneups of the …
yarikoptic Feb 22, 2024
749169f
Add logging for search + add test cases for NOT -- two still fail :-/
yarikoptic Feb 23, 2024
44ddcd9
Reformat code
candleindark Feb 24, 2024
795e966
Handle `null` column values correctly in `ilike` queries
candleindark Feb 24, 2024
5d0a868
Handle `null` column values in `ilike` search funcs instead
candleindark Feb 24, 2024
2b5d4c4
One more test explicitly :?
yarikoptic Feb 25, 2024
d360f7c
BF: switch to Earley parser, provide "parse_query" interface, separat…
yarikoptic Feb 26, 2024
18eec6f
ENH: a better pragmatic example on composition of query statements
yarikoptic Feb 26, 2024
01336a8
support for copy search query examples to clipboard. Could be made mo…
yarikoptic Feb 26, 2024
5f1c1a3
Add a little more to testing of exceptions
yarikoptic Feb 26, 2024
0fe9142
Add # pragma: no cover for statements which are pretty much impossi…
yarikoptic Feb 26, 2024
81b61f0
Remove reference to a font which was an earlier attempt to add a clip…
yarikoptic Feb 26, 2024
926b47c
RF: unify to "Search query" and also have button for "Clear" after "S…
yarikoptic Feb 26, 2024
564b68a
Add 1 em margin on top of table to let tips not occlude buttons
yarikoptic Feb 27, 2024
8db70a3
Fix the test for renamed filter -> query
yarikoptic Feb 27, 2024
9f5741d
Run testing for all PRs
yarikoptic Feb 27, 2024
f2fa21c
ENH: add a tricky case for searching for a string within JSON record
yarikoptic Feb 28, 2024
27a656e
Rename test since it is no longer just smoke
yarikoptic Feb 28, 2024
31b5803
Remove space in example for restrictive metadata search
candleindark Feb 27, 2024
be2ffbf
Add explanation for the last query syntax example
candleindark Feb 27, 2024
2e79584
Replace list with iterator
candleindark Feb 27, 2024
19fe126
Update `overview` endpoint to use modern SQLAlchemy syntax
candleindark Feb 27, 2024
84c8699
Rename variable
candleindark Feb 27, 2024
bb50a71
Update `test_search_complex_smoke` to use modern syntax of SQLAlchemy
candleindark Feb 27, 2024
014aa85
Bring back space in the search "JSON string" example
yarikoptic Feb 28, 2024
c38fdba
Improve tests of overview -- test for ERROR to be given for a wrong q…
yarikoptic Feb 28, 2024
f2ee385
Rename tests for the overview page
candleindark Feb 28, 2024
dfaa327
Group tests for the search function into a class
candleindark Feb 28, 2024
8e85071
Rename the tests for the search function
candleindark Feb 28, 2024
130cb92
Provide doc strings for tests of search functionality
candleindark Feb 28, 2024
9b4fb25
Move tests for search out of a class
candleindark Feb 28, 2024
1bdd6a3
Specify purpose of the `test_search.py` module in comment
candleindark Feb 28, 2024
209d424
Merge pull request #319 from candleindark/enh-search-review
candleindark Feb 28, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ repos:
- id: black

- repo: https://github.com/PyCQA/isort
rev: 5.11.4
rev: 5.13.2
hooks:
- id: isort

Expand All @@ -28,5 +28,13 @@ repos:
- flake8-builtins
- flake8-unused-arguments

- repo: https://github.com/codespell-project/codespell
rev: v2.2.6
hooks:
- id: codespell
additional_dependencies:
- tomli


default_language_version:
python: python3.11
42 changes: 12 additions & 30 deletions datalad_registry/overview.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,10 @@
import logging

from flask import Blueprint, render_template, request
from sqlalchemy import Text, nullslast, or_
from sqlalchemy import nullslast

from datalad_registry.models import RepoUrl, URLMetadata, db
from datalad_registry.models import RepoUrl, db
from datalad_registry.search import parser

lgr = logging.getLogger(__name__)
bp = Blueprint("overview", __name__, url_prefix="/overview")
Expand Down Expand Up @@ -35,36 +36,16 @@

# Apply filter if provided
filter = request.args.get("filter", None, type=str)
filter_error = None
if filter:
lgr.debug("Filter URLs by '%s'", filter)

escape = "\\"
escaped_filter = (
filter.replace(escape, escape + escape)
.replace("%", escape + "%")
.replace("_", escape + "_")
)
pattern = f"%{escaped_filter}%"

r = r.filter(
or_(
RepoUrl.url.ilike(pattern, escape=escape),
RepoUrl.ds_id.ilike(pattern, escape=escape),
RepoUrl.head.ilike(pattern, escape=escape),
RepoUrl.head_describe.ilike(pattern, escape=escape),
RepoUrl.branches.cast(Text).ilike(pattern, escape=escape),
RepoUrl.tags.ilike(pattern, escape=escape),
RepoUrl.metadata_.any(
or_(
URLMetadata.extractor_name.ilike(pattern, escape=escape),
# search the entire JSON column as text
URLMetadata.extracted_metadata.cast(Text).ilike(
pattern, escape=escape
),
)
),
)
)
try:
filter_query = parser.parse(filter)
except Exception as e:
filter_error = str(e)

Check warning on line 45 in datalad_registry/overview.py

View check run for this annotation

Codecov / codecov/patch

datalad_registry/overview.py#L44-L45

Added lines #L44 - L45 were not covered by tests
# filter = None
else:
r = r.filter(filter_query)

# Sort
r = r.group_by(RepoUrl)
Expand All @@ -83,4 +64,5 @@
pagination=pagination,
sort_by=sort_by,
url_filter=filter,
url_filter_error=filter_error,
)
232 changes: 232 additions & 0 deletions datalad_registry/search.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,232 @@
"""Functionality for support of ad-hoc search "language".
"""

from functools import partial

from lark import Lark, Token, Transformer, Tree, v_args
from sqlalchemy import CollectionAggregate, ColumnElement, Text, and_, not_, or_

from .models import RepoUrl, URLMetadata

#
# Constructs for the case-insensitive search via ilike
#
escape = "\\"


def _escape_for_ilike(value: str) -> str:
escaped_value = (
value.replace(escape, escape + escape)
.replace("%", escape + "%")
.replace("_", escape + "_")
)
return f"%{escaped_value}%"


# ATM most of our search targets are 1:1 mapping to DB schema.
# List explicitly so we could reuse in grammar and error messages etc
yarikoptic marked this conversation as resolved.
Show resolved Hide resolved
# without duplication
known_fields_RepoUrl_1to1 = ["url", "ds_id", "head", "head_describe", "tags"]


def get_ilike_search(model, field: str, value: str):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_ilike_search(model, field: str, value: str):
def _get_ilike_expr(model, field: str, value: str):

This name is more consistent with SQLAlchemy's convention. I also feel that this function depends heavily on the context of the containing file and should be made private.

return getattr(model, field).ilike(_escape_for_ilike(value), escape=escape)


def get_metadata_ilike_search(value):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_metadata_ilike_search(value):
def _get_metadata_ilike_expr(value):

This name is more consistent with SQLAlchemy's convention. I also feel that this function depends heavily on the context of the containing file and should be made private.

escaped_value = _escape_for_ilike(value)
return RepoUrl.metadata_.any(
or_(
URLMetadata.extractor_name.ilike(escaped_value, escape=escape),
URLMetadata.extracted_metadata.cast(Text).ilike(
escaped_value, escape=escape
),
)
)


def get_branches_ilike_search(value):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_branches_ilike_search(value):
def _get_branches_ilike_expr(value):

This name is more consistent with SQLAlchemy's convention. I also feel that this function depends heavily on the context of the containing file and should be made private.

return RepoUrl.branches.cast(Text).ilike(_escape_for_ilike(value), escape=escape)


# mapping to schema of fields
known_fields = {
_: partial(get_ilike_search, RepoUrl, _) for _ in known_fields_RepoUrl_1to1
}
known_fields["metadata"] = get_metadata_ilike_search # type: ignore
known_fields["branches"] = get_branches_ilike_search # type: ignore

# TODO: add "metadata_extractor"?
known_fields_str = "|".join(f'"{_}"' for _ in known_fields)

#
# This is a lark grammar for the search query language.
# For interactive testing, you can use the Lark parser at
# https://www.lark-parser.org/ide/ .
#
grammar = rf"""
?start: search
?search: orand_exp
?orand_exp: not_exp
| orand_exp ("OR" not_exp)+ -> or_search
| orand_exp (("AND"|) not_exp)+ -> and_search
?not_exp: primary
| "NOT" primary -> not_expr

?primary: "(" search ")" // -> group
| secondary

// if no op, so just `:` it would be identical to `:?` (present within)
?secondary: (field_select | unknown_field_error) ":" op? (quoted_string | WORD) -> field_matched
| WORD -> search_word
| quoted_string -> search_string

// later to subselect metadata
// ?field_select: (field | metadata_field "[" metadata_extractors "]") value_path?
// ?value_path: ("[" (WORD|quoted_string) "]")+
?field_select: field
| metadata_field "[" metadata_extractors "]" -> field_select
?metadata_extractors: WORD (","WORD)*
!field: {known_fields_str}
!metadata_field: "metadata"
// To make it easier for user to be informed about an unknown field being \
// specified in the field_matching
!unknown_field_error: WORD

// Let's make it into a rule so we could have transformer for it
// and unescape escaped \s
?quoted_string: ESCAPED_STRING // /"((?:\.|[^\"])*)"/
WORD: /[-_.*?+\w]+/

// :operations to aim for
!op: "?" // string present somewhere within
|"=" // equal to the value as a string
|">" // greater to the value as a string
|"~" // regex search, should be able to use ^ and $ to anchor if desired
// TODO:
// - come up with transformers, e.g. (int) or `!int` to precede :? etc which
// should change how operations like =, >
// - operator "in" might come handy
// - also to make it case-sensitive since to make most useful - by default are\
// case insensitive!

%import common.ESCAPED_STRING
%import common.WS
%ignore WS
"""


def _dump_grammar():
# Could be used for debugging on the web -- the schema with those fields embedded
with open("/tmp/grammar.lark", "w") as f:
f.write(grammar)

Check warning on line 122 in datalad_registry/search.py

View check run for this annotation

Codecov / codecov/patch

datalad_registry/search.py#L121-L122

Added lines #L121 - L122 were not covered by tests


# _dump_grammar()


@v_args(inline=True) # Affects the signatures of the methods
class SearchQueryTransformer(Transformer):
"""Convert the parsed search query into SQLAlchemy expressions."""

def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# Additional initialization here if needed

def or_search(self, *args) -> ColumnElement[bool]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def or_search(self, *args) -> ColumnElement[bool]:
def or_exp(self, *args) -> ColumnElement[bool]:

This name is more inline with SQLAlchemy's convention.

# args will be a list of the arguments to the OR operation
return or_(*args)

def and_search(self, *args) -> ColumnElement[bool]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def and_search(self, *args) -> ColumnElement[bool]:
def and_exp(self, *args) -> ColumnElement[bool]:

This name is more inline with SQLAlchemy's convention.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those must correspond to the names of lark grammar -- defined above at

https://github.com/datalad/datalad-registry/pull/308/files#diff-11394d8200a55a0ad0089b9d62dabe6194d7a14e711147c2b8d63a795b5aa2dbR82

I don't mind changing here and there but then go ahead change and send PR so we make sure that it works

# args will be a list of the arguments to the AND operation
return and_(*args)

def not_expr(self, arg) -> ColumnElement[bool]:
# args will be a single-element list containing the NOT argument
return not_(arg)

Check warning on line 146 in datalad_registry/search.py

View check run for this annotation

Codecov / codecov/patch

datalad_registry/search.py#L146

Added line #L146 was not covered by tests

def get_field_select_search(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_field_select_search(
def get_field_select_expr(

This name is more inline with SQLAlchemy's convention.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this and below -- I don't mind but make a complete change/refactoring and send as a PR or a commit making sure that all renamed/works.

self, metadata_field_l, metadata_extractors_l, value
) -> CollectionAggregate[bool]:
assert metadata_field_l.data.value == "metadata_field"
if isinstance(metadata_extractors_l, Token):
extractors = [metadata_extractors_l.value]
else: # was more than one
assert isinstance(metadata_extractors_l, Tree)
assert metadata_extractors_l.data.value == "metadata_extractors" # type: ignore
extractors = list(map(self._get_str_value, metadata_extractors_l.children)) # type: ignore
if not extractors:
raise ValueError(f"No extractors were specified in {metadata_extractors_l}")

Check warning on line 159 in datalad_registry/search.py

View check run for this annotation

Codecov / codecov/patch

datalad_registry/search.py#L159

Added line #L159 was not covered by tests
# TODO: might want to provide a dedicated simpler one
# elif len(extractors) == 1:
# raise NotImplementedError(f"Single extractor {extractors}")
else:
# ??? it seems we do not have search target value here, so we are to
# return the function to search with but we can't since here we already
# need to know ilike vs exact match
return RepoUrl.metadata_.any(
and_(
or_(*(URLMetadata.extractor_name == _ for _ in extractors)),
# search the entire JSON column as text
URLMetadata.extracted_metadata.cast(Text).icontains(
value, autoescape=True
),
)
)

def field_matched(self, *args):
# Example for handling field-specific matches. You'll need to expand this
# logic based on your specific fields and operations.
if len(args) == 3:
field_l, op_l, value_l = args
assert op_l.data.value == "op"
op = op_l.children[0].value

Check warning on line 183 in datalad_registry/search.py

View check run for this annotation

Codecov / codecov/patch

datalad_registry/search.py#L181-L183

Added lines #L181 - L183 were not covered by tests
elif len(args) == 2:
# ":" is identical to ":?
field_l, value_l = args
op = "?"
else:
raise ValueError(f"Unexpected number of args: {len(args)} in {args}")

Check warning on line 189 in datalad_registry/search.py

View check run for this annotation

Codecov / codecov/patch

datalad_registry/search.py#L189

Added line #L189 was not covered by tests

if field_l.data == "field_select":
search = partial(self.get_field_select_search, *field_l.children)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
search = partial(self.get_field_select_search, *field_l.children)
get_query_exp = partial(self.get_field_select_search, *field_l.children)

This name is more inline with SQLAlchemy's convention.

elif field_l.data.value == "field":
assert len(field_l.children) == 1 # TODO: handle multiple??
field = field_l.children[0].value
search = known_fields[field]
else:
raise ValueError(f"Unknown field type: {field_l}")

Check warning on line 198 in datalad_registry/search.py

View check run for this annotation

Codecov / codecov/patch

datalad_registry/search.py#L198

Added line #L198 was not covered by tests

if op == "?":
return search(self._get_str_value(value_l))
else:
raise NotImplementedError(f"Operation {op} is not implemented")

Check warning on line 203 in datalad_registry/search.py

View check run for this annotation

Codecov / codecov/patch

datalad_registry/search.py#L203

Added line #L203 was not covered by tests

def _get_str_value(self, arg: Token) -> str:
assert isinstance(arg, Token)
value = arg.value
if arg.type == "WORD":
return value
elif arg.type == "ESCAPED_STRING":
return value[1:-1].replace(r"\"", '"')
else:
raise TypeError(arg.type)

Check warning on line 213 in datalad_registry/search.py

View check run for this annotation

Codecov / codecov/patch

datalad_registry/search.py#L213

Added line #L213 was not covered by tests

def _get_str_search(self, arg: Token) -> ColumnElement[bool]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _get_str_search(self, arg: Token) -> ColumnElement[bool]:
def _get_str_query_exp(self, arg: Token) -> ColumnElement[bool]:

This name is more consistent the SQLAlChemy convention.

value = self._get_str_value(arg)
return or_(*[f(value) for f in known_fields.values()])

search_word = _get_str_search
search_string = _get_str_search

def unknown_field_error(self, arg):
# Handle unknown fields
raise ValueError(
f"Unknown field: '{arg}'. Known are: {', '.join(known_fields)}"
)


# Assuming `grammar` is your grammar definition string
parser = Lark(grammar, parser="lalr", transformer=SearchQueryTransformer())
# Example:
# r = session.query(RepoUrl).filter(parser.parse("url:example OR ds_id:handbook"))
Loading