Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search MVP default search method with IDN data #153

Open
jamiefeiss opened this issue Sep 13, 2023 · 7 comments
Open

Search MVP default search method with IDN data #153

jamiefeiss opened this issue Sep 13, 2023 · 7 comments

Comments

@jamiefeiss
Copy link
Collaborator

Testing the "default" regex search method takes over 30s against the IDN triplestore for the following query:

http://localhost:8000/search?term=open&method=default&limit=10&focus-to-filter[rdf:type]=http%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23Concept&focus-to-filter[skos:inScheme]=https%3A%2F%2Flinked.data.gov.au%2Fdef%2Fdata-access-rights

PREFIX dcterms: <http://purl.org/dc/terms/>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX prez: <https://prez.dev/>
    CONSTRUCT {
    ?hashID a prez:SearchResult ;
        prez:searchResultWeight ?weight ;
        prez:searchResultPredicate ?predicate ;
        prez:searchResultMatch ?match ;
        prez:searchResultURI ?search_result_uri . 
        ?search_result_uri ?p ?o1 .



        ?o1 ?p2 ?o2 .
        ?o2 ?p3 ?o3 .     
}
    WHERE {
        { 
    SELECT ?search_result_uri ?predicate ?match ?weight ?hashID
    WHERE {
      ?search_result_uri ?predicate ?match .

        ?search_result_uri <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2004/02/skos/core#Concept>.
?search_result_uri <http://www.w3.org/2004/02/skos/core#inScheme> <https://linked.data.gov.au/def/data-access-rights>.

      FILTER (
        LCASE(?match) = "open" ||
        REGEX(?match, "^open", "i") ||
        REGEX(?match, "\bopen\b", "i") ||
        REGEX(?match, "open", "i")
      )
      BIND(
        IF(LCASE(?match) = "open", 10,
          IF(REGEX(?match, "^open", "i"), 7,
            IF(REGEX(?match, "\bopen\b", "i"), 5,
              IF(REGEX(?match, "open", "i"), 3, 0)
            )
          )
        ) AS ?weight
      )
    BIND(URI(CONCAT("urn:hash:", SHA256(CONCAT(STR(?search_result_uri), STR(?predicate), STR(?match), STR(?weight))))) AS ?hashID)
    }
    LIMIT 10
         }
        {
            ?search_result_uri ?p ?o1 . 

                                        OPTIONAL {
                FILTER(ISBLANK(?o1))
                ?o1 ?p2 ?o2 .
                OPTIONAL {
                        FILTER(ISBLANK(?o2))
                        ?o2 ?p3 ?o3 .
                }
        }        }

        UNION {
                    }
    }
@jamiefeiss
Copy link
Collaborator Author

jamiefeiss commented Sep 13, 2023

Testing with just the inner SELECT query now, the main issue seems to be that this query searches across all triples. Also, this weighted regex is significantly faster (0.035s vs 29.189s in Fuseki) if we implement something similar to the "skosWeighted" search method - https://github.com/RDFLib/prez/blob/main/prez/reference_data/search_methods/search_skos_weighted.ttl . See below:

SELECT ?search_result_uri ?predicate ?match (SUM(?w) AS ?weight) ?hashID
WHERE {
	?search_result_uri ?predicate ?match .
	?search_result_uri <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2004/02/skos/core#Concept> .
	?search_result_uri <http://www.w3.org/2004/02/skos/core#inScheme> <https://linked.data.gov.au/def/data-access-rights> .

	BIND(URI(CONCAT("urn:hash:", SHA256(CONCAT(STR(?search_result_uri), STR(?predicate), STR(?match))))) AS ?hashID)
  
  {
    ?search_result_uri ?predicate ?match .
    BIND (50 AS ?w)
    FILTER (REGEX(?match, "^open$", "i"))
  } UNION {
    ?search_result_uri ?predicate ?match .
    BIND (20 AS ?w)
    FILTER (REGEX(?match, "^open", "i"))
  } UNION {
    ?search_result_uri ?predicate ?match .
    BIND (10 AS ?w)
    FILTER (REGEX(?match, "open", "i"))
  }
} GROUP BY ?search_result_uri ?predicate ?match ?hashID ORDER BY DESC(?weight) LIMIT 10

Since we'll probably only be searching across labels & descriptions, and returning objects that have endpoints in Prez, we could restrict the predicates that are matched and the base classes of the results to further optimise the query.

@recalcitrantsupplant
Copy link
Collaborator

Looks like it's the query structure. Lets see if we can add back in the CONSTRUCT to your performant REGEX above.

For context as well, FTS query below.

http://idn-fuseki-lb-155137521.ap-southeast-2.elb.amazonaws.com:3030/#/dataset/idn/query?query=PREFIX%20skos%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23%3E%0APREFIX%20dcterms%3A%20%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E%0APREFIX%20ex%3A%20%3Chttp%3A%2F%2Fwww.example.org%2Fresources%23%3E%0APREFIX%20text%3A%20%3Chttp%3A%2F%2Fjena.apache.org%2Ftext%23%3E%0APREFIX%20sdo%3A%20%3Chttps%3A%2F%2Fschema.org%2F%3E%0APREFIX%20rdfs%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0A%0ASELECT%20%3FMatchURI%20%28COALESCE%28%3Fprop_label%2C%20%3FMatchProp%29%20AS%20%3FMatchProperty%29%20%3FMatchTerm%20%3FSearchTerm%0A%7B%0A%20%20VALUES%20%3FSearchTerm%20%7B%22%2Aopen%2A%22%0A%20%20%7D%0A%20%20%28%3FMatchURI%20%3FWeight%20%3FMatchTerm%20%3Fgraph%20%3FMatchProp%29%20text%3Aquery%20%28%20ex%3ANameProps%20%3FSearchTerm%29%20.%0A%0A%20%20OPTIONAL%20%7B%0A%20%20%20%20%3FMatchURI%20skos%3AprefLabel%7Crdfs%3Alabel%7Cdcterms%3Atitle%7Csdo%3Aname%20%3Fmatch_label%20.%0A%20%20%7D%0A%0A%20%20OPTIONAL%20%7B%0A%20%20%20%20%3FMatchProp%20skos%3AprefLabel%7Crdfs%3Alabel%7Cdcterms%3Atitle%7Csdo%3Aname%20%3Fprop_label%20.%0A%20%20%7D%0A%7D

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX ex: <http://www.example.org/resources#>
PREFIX text: <http://jena.apache.org/text#>
PREFIX sdo: <https://schema.org/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?MatchURI (COALESCE(?prop_label, ?MatchProp) AS ?MatchProperty) ?MatchTerm ?SearchTerm
{
  VALUES ?SearchTerm {"*open*"
  }
  (?MatchURI ?Weight ?MatchTerm ?graph ?MatchProp) text:query ( ex:NameProps ?SearchTerm) .

  OPTIONAL {
    ?MatchURI skos:prefLabel|rdfs:label|dcterms:title|sdo:name ?match_label .
  }

  OPTIONAL {
    ?MatchProp skos:prefLabel|rdfs:label|dcterms:title|sdo:name ?prop_label .
  }
}

@recalcitrantsupplant
Copy link
Collaborator

How does this look?

  • uses UNION structure as above from Jamie to improve performance
  • adds back in properties/blank nodes for objects
  • provides different matches rather than aggregating as per original query - I'm not too fused either way - @hjohns @jamiefeiss any opinions on adding weights vs providing multiple search results, one per weight?
PREFIX prez: <https://prez.dev/>
CONSTRUCT {
  ?hashID a prez:SearchResult ;
    prez:searchResultWeight ?w ;
    prez:searchResultPredicate ?predicate ;
    prez:searchResultMatch ?match ;
    prez:searchResultURI ?search_result_uri . 
  ?search_result_uri ?p ?o1 .
  ?o1 ?p2 ?o2 .
  ?o2 ?p3 ?o3 .     
}
WHERE {
  {
    SELECT ?search_result_uri ?predicate ?match ?w ?hashID
    WHERE {
      ?search_result_uri ?predicate ?match .
      ?search_result_uri <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2004/02/skos/core#Concept> .
      ?search_result_uri <http://www.w3.org/2004/02/skos/core#inScheme> <https://linked.data.gov.au/def/data-access-rights> .
      BIND(URI(CONCAT("urn:hash:", SHA256(CONCAT(STR(?search_result_uri), STR(?predicate), STR(?match))))) AS ?hashID)
      {
        ?search_result_uri ?predicate ?match .
        BIND (50 AS ?w)
        FILTER (REGEX(?match, "^open$", "i"))
      } UNION {
        ?search_result_uri ?predicate ?match .
        BIND (20 AS ?w)
        FILTER (REGEX(?match, "^open", "i"))
      } UNION {
        ?search_result_uri ?predicate ?match .
        BIND (10 AS ?w)
        FILTER (REGEX(?match, "open", "i"))
      }
    }
    GROUP BY ?search_result_uri ?predicate ?match ?hashID ?w 
    LIMIT 10
  }
  ?search_result_uri ?p ?o1 .
  OPTIONAL {
    FILTER(ISBLANK(?o1))
    ?o1 ?p2 ?o2 .
    OPTIONAL {
      FILTER(ISBLANK(?o2))
      ?o2 ?p3 ?o3 .
    }
  }  
}

@jamiefeiss
Copy link
Collaborator Author

jamiefeiss commented Sep 13, 2023

Looks good, nice and fast at about 0.035s.

Not aggregating just means you'll get duplicate results in the case where a result satisfies multiples matches.

What do you think of restricting the matched predicate to labels & descriptions? Description matching could be worth less too. Also what do you think of restricting the base class to classes Prez supports?

@recalcitrantsupplant
Copy link
Collaborator

recalcitrantsupplant commented Sep 13, 2023

What do you think of restricting the matched predicate to labels & descriptions?

This would be a closed profile with no properties defined. You'll then get labels/descriptions when the annotations are added. Profiles changes coming soon ..

Description matching could be worth less too.

Sounds good - any issue adding LCASE back in too for "exact" match?

      {
        ?search_result_uri ?predicate ?match .
        BIND (100 AS ?w)
        FILTER (LCASE(?match) = "open")
      } 
      UNION
...

Also what do you think of restricting the base class to classes Prez supports?

Ideally I think prez could display whatever information about whatever object is found, perhaps on a generic page if there isn't a suitable endpoint

@recalcitrantsupplant
Copy link
Collaborator

David to:

  • allow list of filter values in API, treat these as a VALUES clause, e.g. for search across multiple vocabularies
  • add predicates to query - REGEX performance is bad on even smaller datasets where there isn't some filtering.

@recalcitrantsupplant
Copy link
Collaborator

Resolved in #149

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🆕 New
Development

No branches or pull requests

2 participants