Full Text Indexing Strategy #1432
GavinMendelGleason
started this conversation in
Ideas
Replies: 2 comments 2 replies
-
This Xapian Documentation is useful for getting an overview of tf-idf approaches. |
Beta Was this translation helpful? Give feedback.
0 replies
-
just a comment on fulltext index search: On the other hand, the solr ecosystem brings so many interesting/needed functionalities to the table, that it may be worth a second or even third look.
deep link to atomic index updates: |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
For a long time we have recognised that TerminusDB needs a strategy for full-text indexing. Many applications require the ability to
search text using semi-structured or unstructured queries such that documents of relevance can be found quickly.
In past applications using TerminusDB we have married our internal database with Solr. This provided fast query response, but managing
the indexing, and incremental indexing strategy and cache coherence turned out to be very brittle and complex.
We have therefore attempted an experiment to see if full text indexing could be done directly in TerminusDB. I think it's fair to say that the experiment was very successful, and proved both easy and quite performant.
While this demonstrates that full text indexing is possible in TerminusDB, it doesn't really give us a strategy for management of full text indexing. This discussion document is designed to bring some of these issues to the fore.
Because there are many possible design choices, and we need to choose one of them for our MVP, I have just put forward my best guess
as to the strategy we should employ. I will modify this document with additional considerations and strategies which are proposed as people
point out other possibilities.
Where does the index reside?
The first pressing problem is where to actually put an index? How is it related to a given branch? Since branches can contain different
data, indexes must either reflect which branch they are covering, or we need to embed the branch information in the index itself. And how
is this index referred to by a given ref as being a completed index relevant to the ref?
Proposal
We store a separate data product whose commit head is referred to by the commit for a given ref. When looking up the ref, we can clearly
see that there is an index covering this commit. The commit object would be modified to look as follows:
The index commit would point to the data for the indexing. The index schema would be implicit, but stored already in a predefined schema
similar to the one present for ref, repo, system etc.
How do we mark for indexing?
We need a way to describe what is to be indexed. Presumably this should be a schema annotation that allows the user to specify what
should and should not end up in the full text index.
Proposal
As we will need to be able to index multilanguage data we need a way to split on the language field during tokenisation. The specification
should therefore allow us to determine the approach to take at the schema level.
We should have a system wide default language specified in the schema context object, along with all definitions of viable languages for
the given schema. We should allow two types of language specifications. The index should be allowed to specify a
We add another annotation to the class which might look something like:
Since no language is specified, the tokeniser will be passed the default language. Older schemata with no specified language will be
given
en
.For
rdf:langString
we can simply pass the appropriate language for indexing.For
xsd:string
we can obtain the language field from a parameter given by@language_field
to the@index
.In addition it would be useful for the program performing tokenisation to get the class and field information. This may be necessary for
special treatment of specific fields such as addresses, programming language fields etc. which require alternative tokenisation approaches not dependent solely on language.
Who does tokenization for data and query?
Tokenisation here refers to the process of parsing, stemming, recognising classes (such as named entitites) and n-gramising data for
term construction. This is complicated, changes for different types of data, and changes radically depending on language. For this reason
TerminusDB can't really be responsible for dealing with tokenization.
Proposal
TerminusDB will recognise bags of words and calculate a
tf_idf
only. All the remaining work of tokenisation will be shunted to auser program which is called on specific fields with parameters to improve tokenisation.
The programme, specified as
TERMINUSDB_TOKENIZER
will be called with a syntax as follows:This tokeniser can be implemented in any language, but we will supply an example English tokenizer written in python.
In addition we will have another variable
TERMINUSDB_QUERY_TOKENIZER
which may or not use the same tokenizer and tokenizationstrategy. This will allow the user to direct query tokenization in full text searching in WOQL.
What is in a term?
Once the tokenizer has run, we need to do term construction. Terms however do not really make sense as shared over all field types, they
have different weightings and possibly even different meanings.
Proposal
We should create derived terms, which reflect the field/class being used to define the Term. A term will be constructed as
Class:Field:encoded(Token)
with the encoding escaping the class and field separator character. In the case of superclass indexdefinitions we should use the super-class rather than the most specific class.
How do we index?
TerminusDB needs to be able to show what is indexed, and to reconstruct indices when stale.
Proposal
Branches should be explicitly marked as indexed such that we can decided which commits require an index and which do not. It is useful
to have branches which have the same schema (and therefore the same indexing annotations), but have no index in branches - for example
when programmatically updating the database.
When we have updated an indexed branch we technically need not only to index the given document and invert the index, but we will need to change the
tf_idf
for all relevant documents. Instead of this approach we should trigger a scaling operation that firesprobabilistically after some number of index changes based on the fraction of changed documents.
Beta Was this translation helpful? Give feedback.
All reactions