Full Text Indexing Strategy #1432

GavinMendelGleason · 2022-09-02T11:33:05Z

GavinMendelGleason
Sep 2, 2022
Maintainer

For a long time we have recognised that TerminusDB needs a strategy for full-text indexing. Many applications require the ability to
search text using semi-structured or unstructured queries such that documents of relevance can be found quickly.

In past applications using TerminusDB we have married our internal database with Solr. This provided fast query response, but managing
the indexing, and incremental indexing strategy and cache coherence turned out to be very brittle and complex.

We have therefore attempted an experiment to see if full text indexing could be done directly in TerminusDB. I think it's fair to say that the experiment was very successful, and proved both easy and quite performant.

While this demonstrates that full text indexing is possible in TerminusDB, it doesn't really give us a strategy for management of full text indexing. This discussion document is designed to bring some of these issues to the fore.

Because there are many possible design choices, and we need to choose one of them for our MVP, I have just put forward my best guess
as to the strategy we should employ. I will modify this document with additional considerations and strategies which are proposed as people
point out other possibilities.

Where does the index reside?

The first pressing problem is where to actually put an index? How is it related to a given branch? Since branches can contain different
data, indexes must either reflect which branch they are covering, or we need to embed the branch information in the index itself. And how
is this index referred to by a given ref as being a completed index relevant to the ref?

Proposal

We store a separate data product whose commit head is referred to by the commit for a given ref. When looking up the ref, we can clearly
see that there is an index covering this commit. The commit object would be modified to look as follows:

{ "@id" : "Commit",
  "@type" : "Class",
  "@key" : { "@type" : "Lexical",
             "@fields" : [ "identifier" ] },
  "instance" : { "@type" : "Optional",
                 "@class" : "layer:Layer" },
  "schema" : "layer:Layer",
  "index" : { "@type" : "Optional",
              "@class" : "layer:Layer" },
  "author" : "xsd:string",
  "message" : "xsd:string",
  "identifier" : "xsd:string",
  "timestamp" : "xsd:decimal",
  "parent" : { "@type" : "Optional",
               "@class" : "Commit" } }

The index commit would point to the data for the indexing. The index schema would be implicit, but stored already in a predefined schema
similar to the one present for ref, repo, system etc.

How do we mark for indexing?

We need a way to describe what is to be indexed. Presumably this should be a schema annotation that allows the user to specify what
should and should not end up in the full text index.

Proposal

As we will need to be able to index multilanguage data we need a way to split on the language field during tokenisation. The specification
should therefore allow us to determine the approach to take at the schema level.

We should have a system wide default language specified in the schema context object, along with all definitions of viable languages for
the given schema. We should allow two types of language specifications. The index should be allowed to specify a

We add another annotation to the class which might look something like:

{ "@type" : "@context",
  "@default_language" : "en",
  "@documentation" : [{ "@language" : "en"}, {"@language" : "de-AT" }] }
{ "@type" : "Class",
  "@index" : { "@fields" ["name", "title"], "@language_field" : "language" },
  "language" : "xsd:language",
  "title" : "xsd:string",
  "name" : "xsd:string" }

Since no language is specified, the tokeniser will be passed the default language. Older schemata with no specified language will be
given en.

For rdf:langString we can simply pass the appropriate language for indexing.

For xsd:string we can obtain the language field from a parameter given by @language_field to the @index.

In addition it would be useful for the program performing tokenisation to get the class and field information. This may be necessary for
special treatment of specific fields such as addresses, programming language fields etc. which require alternative tokenisation approaches not dependent solely on language.

Who does tokenization for data and query?

Tokenisation here refers to the process of parsing, stemming, recognising classes (such as named entitites) and n-gramising data for
term construction. This is complicated, changes for different types of data, and changes radically depending on language. For this reason
TerminusDB can't really be responsible for dealing with tokenization.

Proposal

TerminusDB will recognise bags of words and calculate a tf_idf only. All the remaining work of tokenisation will be shunted to a
user program which is called on specific fields with parameters to improve tokenisation.

The programme, specified as TERMINUSDB_TOKENIZER will be called with a syntax as follows:

TERMINUSDB_TOKENIZER --class={class} --field={field} --language={language} --text={text}

This tokeniser can be implemented in any language, but we will supply an example English tokenizer written in python.

In addition we will have another variable TERMINUSDB_QUERY_TOKENIZER which may or not use the same tokenizer and tokenization
strategy. This will allow the user to direct query tokenization in full text searching in WOQL.

What is in a term?

Once the tokenizer has run, we need to do term construction. Terms however do not really make sense as shared over all field types, they
have different weightings and possibly even different meanings.

Proposal

We should create derived terms, which reflect the field/class being used to define the Term. A term will be constructed as
Class:Field:encoded(Token) with the encoding escaping the class and field separator character. In the case of superclass index
definitions we should use the super-class rather than the most specific class.

How do we index?

TerminusDB needs to be able to show what is indexed, and to reconstruct indices when stale.

Proposal

Branches should be explicitly marked as indexed such that we can decided which commits require an index and which do not. It is useful
to have branches which have the same schema (and therefore the same indexing annotations), but have no index in branches - for example
when programmatically updating the database.

When we have updated an indexed branch we technically need not only to index the given document and invert the index, but we will need to change the tf_idf for all relevant documents. Instead of this approach we should trigger a scaling operation that fires
probabilistically after some number of index changes based on the fraction of changed documents.

GavinMendelGleason · 2022-09-02T11:45:57Z

GavinMendelGleason
Sep 2, 2022
Maintainer Author

This Xapian Documentation is useful for getting an overview of tf-idf approaches.

0 replies

hpvd · 2023-06-20T08:43:52Z

hpvd
Jun 20, 2023

just a comment on fulltext index search:
I fully understand that making it possible to use/rely on solr would be a pretty hard task.

On the other hand, the solr ecosystem brings so many interesting/needed functionalities to the table, that it may be worth a second or even third look.
Just to name a few:

for the search of people (their work) in the db with foreign names (very often in science) using phonetic-matching: https://solr.apache.org/guide/solr/latest/indexing-guide/phonetic-matching.html#beider-morse-phonetic-matching-bmpm
Facetted search to narrow down the results (like known from ecommerce)
or simply the snowball stemmers for different languages https://solr.apache.org/guide/solr/latest/indexing-guide/filters.html#snowball-porter-stemmer-filter

deep link to atomic index updates:
https://solr.apache.org/guide/solr/latest/indexing-guide/partial-document-updates.html

2 replies

GavinMendelGleason Jun 20, 2023
Maintainer Author

We spent a fair bit of time contemplating this and experimenting, and then found that the results for natural language queries were better served with LLMs, including for phonetic-matching, stemming, etc. We are currently integrating our own vector database (https://github.com/terminusdb-labs/terminusdb-semantic-indexer)) and hope to have it live on TerminusCMS in the next few weeks.

We hope to integrate the search into WOQL and GraphQL in such a way that faceted search will be straightforward, and flexible without having to add an additional query language.

hpvd Jun 20, 2023

thanks for inside and link to your semantic indexer! Sounds really interesting !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TerminusDB

Full Text Indexing Strategy #1432

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

TerminusDB

Full Text Indexing Strategy #1432

GavinMendelGleason Sep 2, 2022 Maintainer

Where does the index reside?

Proposal

How do we mark for indexing?

Proposal

Who does tokenization for data and query?

Proposal

What is in a term?

Proposal

How do we index?

Proposal

Replies: 2 comments · 2 replies

GavinMendelGleason Sep 2, 2022 Maintainer Author

hpvd Jun 20, 2023

GavinMendelGleason Jun 20, 2023 Maintainer Author

hpvd Jun 20, 2023

GavinMendelGleason
Sep 2, 2022
Maintainer

Replies: 2 comments 2 replies

GavinMendelGleason
Sep 2, 2022
Maintainer Author

hpvd
Jun 20, 2023

GavinMendelGleason Jun 20, 2023
Maintainer Author