Search MVP default search method with IDN data #153

jamiefeiss · 2023-09-13T04:59:36Z

Testing the "default" regex search method takes over 30s against the IDN triplestore for the following query:

http://localhost:8000/search?term=open&method=default&limit=10&focus-to-filter[rdf:type]=http%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23Concept&focus-to-filter[skos:inScheme]=https%3A%2F%2Flinked.data.gov.au%2Fdef%2Fdata-access-rights

PREFIX dcterms: <http://purl.org/dc/terms/>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX prez: <https://prez.dev/>
    CONSTRUCT {
    ?hashID a prez:SearchResult ;
        prez:searchResultWeight ?weight ;
        prez:searchResultPredicate ?predicate ;
        prez:searchResultMatch ?match ;
        prez:searchResultURI ?search_result_uri . 
        ?search_result_uri ?p ?o1 .



        ?o1 ?p2 ?o2 .
        ?o2 ?p3 ?o3 .     
}
    WHERE {
        { 
    SELECT ?search_result_uri ?predicate ?match ?weight ?hashID
    WHERE {
      ?search_result_uri ?predicate ?match .

        ?search_result_uri <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2004/02/skos/core#Concept>.
?search_result_uri <http://www.w3.org/2004/02/skos/core#inScheme> <https://linked.data.gov.au/def/data-access-rights>.

      FILTER (
        LCASE(?match) = "open" ||
        REGEX(?match, "^open", "i") ||
        REGEX(?match, "\bopen\b", "i") ||
        REGEX(?match, "open", "i")
      )
      BIND(
        IF(LCASE(?match) = "open", 10,
          IF(REGEX(?match, "^open", "i"), 7,
            IF(REGEX(?match, "\bopen\b", "i"), 5,
              IF(REGEX(?match, "open", "i"), 3, 0)
            )
          )
        ) AS ?weight
      )
    BIND(URI(CONCAT("urn:hash:", SHA256(CONCAT(STR(?search_result_uri), STR(?predicate), STR(?match), STR(?weight))))) AS ?hashID)
    }
    LIMIT 10
         }
        {
            ?search_result_uri ?p ?o1 . 

                                        OPTIONAL {
                FILTER(ISBLANK(?o1))
                ?o1 ?p2 ?o2 .
                OPTIONAL {
                        FILTER(ISBLANK(?o2))
                        ?o2 ?p3 ?o3 .
                }
        }        }

        UNION {
                    }
    }

The text was updated successfully, but these errors were encountered:

jamiefeiss · 2023-09-13T06:00:55Z

Testing with just the inner SELECT query now, the main issue seems to be that this query searches across all triples. Also, this weighted regex is significantly faster (0.035s vs 29.189s in Fuseki) if we implement something similar to the "skosWeighted" search method - https://github.com/RDFLib/prez/blob/main/prez/reference_data/search_methods/search_skos_weighted.ttl . See below:

SELECT ?search_result_uri ?predicate ?match (SUM(?w) AS ?weight) ?hashID
WHERE {
	?search_result_uri ?predicate ?match .
	?search_result_uri <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2004/02/skos/core#Concept> .
	?search_result_uri <http://www.w3.org/2004/02/skos/core#inScheme> <https://linked.data.gov.au/def/data-access-rights> .

	BIND(URI(CONCAT("urn:hash:", SHA256(CONCAT(STR(?search_result_uri), STR(?predicate), STR(?match))))) AS ?hashID)
  
  {
    ?search_result_uri ?predicate ?match .
    BIND (50 AS ?w)
    FILTER (REGEX(?match, "^open$", "i"))
  } UNION {
    ?search_result_uri ?predicate ?match .
    BIND (20 AS ?w)
    FILTER (REGEX(?match, "^open", "i"))
  } UNION {
    ?search_result_uri ?predicate ?match .
    BIND (10 AS ?w)
    FILTER (REGEX(?match, "open", "i"))
  }
} GROUP BY ?search_result_uri ?predicate ?match ?hashID ORDER BY DESC(?weight) LIMIT 10

Since we'll probably only be searching across labels & descriptions, and returning objects that have endpoints in Prez, we could restrict the predicates that are matched and the base classes of the results to further optimise the query.

recalcitrantsupplant · 2023-09-13T06:07:51Z

Looks like it's the query structure. Lets see if we can add back in the CONSTRUCT to your performant REGEX above.

For context as well, FTS query below.

http://idn-fuseki-lb-155137521.ap-southeast-2.elb.amazonaws.com:3030/#/dataset/idn/query?query=PREFIX%20skos%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23%3E%0APREFIX%20dcterms%3A%20%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E%0APREFIX%20ex%3A%20%3Chttp%3A%2F%2Fwww.example.org%2Fresources%23%3E%0APREFIX%20text%3A%20%3Chttp%3A%2F%2Fjena.apache.org%2Ftext%23%3E%0APREFIX%20sdo%3A%20%3Chttps%3A%2F%2Fschema.org%2F%3E%0APREFIX%20rdfs%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0A%0ASELECT%20%3FMatchURI%20%28COALESCE%28%3Fprop_label%2C%20%3FMatchProp%29%20AS%20%3FMatchProperty%29%20%3FMatchTerm%20%3FSearchTerm%0A%7B%0A%20%20VALUES%20%3FSearchTerm%20%7B%22%2Aopen%2A%22%0A%20%20%7D%0A%20%20%28%3FMatchURI%20%3FWeight%20%3FMatchTerm%20%3Fgraph%20%3FMatchProp%29%20text%3Aquery%20%28%20ex%3ANameProps%20%3FSearchTerm%29%20.%0A%0A%20%20OPTIONAL%20%7B%0A%20%20%20%20%3FMatchURI%20skos%3AprefLabel%7Crdfs%3Alabel%7Cdcterms%3Atitle%7Csdo%3Aname%20%3Fmatch_label%20.%0A%20%20%7D%0A%0A%20%20OPTIONAL%20%7B%0A%20%20%20%20%3FMatchProp%20skos%3AprefLabel%7Crdfs%3Alabel%7Cdcterms%3Atitle%7Csdo%3Aname%20%3Fprop_label%20.%0A%20%20%7D%0A%7D

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX ex: <http://www.example.org/resources#>
PREFIX text: <http://jena.apache.org/text#>
PREFIX sdo: <https://schema.org/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?MatchURI (COALESCE(?prop_label, ?MatchProp) AS ?MatchProperty) ?MatchTerm ?SearchTerm
{
  VALUES ?SearchTerm {"*open*"
  }
  (?MatchURI ?Weight ?MatchTerm ?graph ?MatchProp) text:query ( ex:NameProps ?SearchTerm) .

  OPTIONAL {
    ?MatchURI skos:prefLabel|rdfs:label|dcterms:title|sdo:name ?match_label .
  }

  OPTIONAL {
    ?MatchProp skos:prefLabel|rdfs:label|dcterms:title|sdo:name ?prop_label .
  }
}

recalcitrantsupplant · 2023-09-13T06:29:47Z

How does this look?

uses UNION structure as above from Jamie to improve performance
adds back in properties/blank nodes for objects
provides different matches rather than aggregating as per original query - I'm not too fused either way - @hjohns @jamiefeiss any opinions on adding weights vs providing multiple search results, one per weight?

PREFIX prez: <https://prez.dev/>
CONSTRUCT {
  ?hashID a prez:SearchResult ;
    prez:searchResultWeight ?w ;
    prez:searchResultPredicate ?predicate ;
    prez:searchResultMatch ?match ;
    prez:searchResultURI ?search_result_uri . 
  ?search_result_uri ?p ?o1 .
  ?o1 ?p2 ?o2 .
  ?o2 ?p3 ?o3 .     
}
WHERE {
  {
    SELECT ?search_result_uri ?predicate ?match ?w ?hashID
    WHERE {
      ?search_result_uri ?predicate ?match .
      ?search_result_uri <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2004/02/skos/core#Concept> .
      ?search_result_uri <http://www.w3.org/2004/02/skos/core#inScheme> <https://linked.data.gov.au/def/data-access-rights> .
      BIND(URI(CONCAT("urn:hash:", SHA256(CONCAT(STR(?search_result_uri), STR(?predicate), STR(?match))))) AS ?hashID)
      {
        ?search_result_uri ?predicate ?match .
        BIND (50 AS ?w)
        FILTER (REGEX(?match, "^open$", "i"))
      } UNION {
        ?search_result_uri ?predicate ?match .
        BIND (20 AS ?w)
        FILTER (REGEX(?match, "^open", "i"))
      } UNION {
        ?search_result_uri ?predicate ?match .
        BIND (10 AS ?w)
        FILTER (REGEX(?match, "open", "i"))
      }
    }
    GROUP BY ?search_result_uri ?predicate ?match ?hashID ?w 
    LIMIT 10
  }
  ?search_result_uri ?p ?o1 .
  OPTIONAL {
    FILTER(ISBLANK(?o1))
    ?o1 ?p2 ?o2 .
    OPTIONAL {
      FILTER(ISBLANK(?o2))
      ?o2 ?p3 ?o3 .
    }
  }  
}

jamiefeiss · 2023-09-13T06:49:09Z

Looks good, nice and fast at about 0.035s.

Not aggregating just means you'll get duplicate results in the case where a result satisfies multiples matches.

What do you think of restricting the matched predicate to labels & descriptions? Description matching could be worth less too. Also what do you think of restricting the base class to classes Prez supports?

recalcitrantsupplant · 2023-09-13T06:58:05Z

What do you think of restricting the matched predicate to labels & descriptions?

This would be a closed profile with no properties defined. You'll then get labels/descriptions when the annotations are added. Profiles changes coming soon ..

Description matching could be worth less too.

Sounds good - any issue adding LCASE back in too for "exact" match?

      {
        ?search_result_uri ?predicate ?match .
        BIND (100 AS ?w)
        FILTER (LCASE(?match) = "open")
      } 
      UNION
...

Also what do you think of restricting the base class to classes Prez supports?

Ideally I think prez could display whatever information about whatever object is found, perhaps on a generic page if there isn't a suitable endpoint

recalcitrantsupplant · 2023-09-13T07:38:03Z

David to:

allow list of filter values in API, treat these as a VALUES clause, e.g. for search across multiple vocabularies
add predicates to query - REGEX performance is bad on even smaller datasets where there isn't some filtering.

recalcitrantsupplant · 2023-09-26T13:07:57Z

Resolved in #149

recalcitrantsupplant mentioned this issue Sep 26, 2023

Search mvp #149

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search MVP default search method with IDN data #153

Search MVP default search method with IDN data #153

jamiefeiss commented Sep 13, 2023

jamiefeiss commented Sep 13, 2023 •

edited

Loading

recalcitrantsupplant commented Sep 13, 2023

recalcitrantsupplant commented Sep 13, 2023

jamiefeiss commented Sep 13, 2023 •

edited

Loading

recalcitrantsupplant commented Sep 13, 2023 •

edited

Loading

recalcitrantsupplant commented Sep 13, 2023

recalcitrantsupplant commented Sep 26, 2023

Search MVP default search method with IDN data #153

Search MVP default search method with IDN data #153

Comments

jamiefeiss commented Sep 13, 2023

jamiefeiss commented Sep 13, 2023 • edited Loading

recalcitrantsupplant commented Sep 13, 2023

recalcitrantsupplant commented Sep 13, 2023

jamiefeiss commented Sep 13, 2023 • edited Loading

recalcitrantsupplant commented Sep 13, 2023 • edited Loading

recalcitrantsupplant commented Sep 13, 2023

recalcitrantsupplant commented Sep 26, 2023

jamiefeiss commented Sep 13, 2023 •

edited

Loading

jamiefeiss commented Sep 13, 2023 •

edited

Loading

recalcitrantsupplant commented Sep 13, 2023 •

edited

Loading