Use JVector to index Vetors of floats - POC #814
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a POC about using jvector to build an index over vectors of float.
JVector is the most advanced library to build indexes over this data type and it will be used in Cassandra 5.0.
Please note that when using the index you won't be doing a full table scan, but on the other side the results with be an "approximation", that is fine for most of the use cases, especially Vector Search for Generative AI.
This is currently a POC.
Easy things to implement:
Hard things:
The main issue is that It seems that when the index is open for writing it is always fully stored in memory, and we can flush it to disk periodically.
I cannot find a good way to not flush the index to disk, the only way I can see with the current version of JVector is to flush the index during a check point.
I guess that in Cassandra there is no problem because they flush the index when the SSTable is flushed to disk and then it become immutable.
In HerdDB we have long lived table-wide indexes and the paging mechanism is handle in another way: we still have immutable pages when they are flushed to disk and we have pages for indexes and indexes are flushed next to the data pages.
We will have to be creative or work with JVector folks to have more support there.
Also in is awkward that we need to store the mapping between a "nodeId" with the PK of the record out side the JVector data set. Currently we can do it with the usual BLink as we do for the PK (the PK stored a mapping bytes -> long) but if we could store the PK into the JVector we will save some coordination (an very likely also disk accesses)
To make clear that you license your contribution under
the Apache License Version 2.0, January 2004
you have to acknowledge this by using the following check-box.