Use JVector to index Vetors of floats - POC #814

eolivelli · 2023-10-13T08:05:35Z

This is a POC about using jvector to build an index over vectors of float.

JVector is the most advanced library to build indexes over this data type and it will be used in Cassandra 5.0.

Please note that when using the index you won't be doing a full table scan, but on the other side the results with be an "approximation", that is fine for most of the use cases, especially Vector Search for Generative AI.

This is currently a POC.

Easy things to implement:

integrate with DDL language (we need to add more space in the index metadata for all the side parameters of the index)
integrate with the Planner (detect ORDER BY .... and decide to use the Index)

Hard things:

find a way to not have the whole JVector index in memory
Implement persistent datastorage
implement checkpoint
Implement a mapping from the "nodeId" (integer) to the primary key (byte array)
implement DELETE (not supported yet in JVector)

The main issue is that It seems that when the index is open for writing it is always fully stored in memory, and we can flush it to disk periodically.

I cannot find a good way to not flush the index to disk, the only way I can see with the current version of JVector is to flush the index during a check point.
I guess that in Cassandra there is no problem because they flush the index when the SSTable is flushed to disk and then it become immutable.
In HerdDB we have long lived table-wide indexes and the paging mechanism is handle in another way: we still have immutable pages when they are flushed to disk and we have pages for indexes and indexes are flushed next to the data pages.

We will have to be creative or work with JVector folks to have more support there.

Also in is awkward that we need to store the mapping between a "nodeId" with the PK of the record out side the JVector data set. Currently we can do it with the usual BLink as we do for the PK (the PK stored a mapping bytes -> long) but if we could store the PK into the JVector we will save some coordination (an very likely also disk accesses)

To make clear that you license your contribution under
the Apache License Version 2.0, January 2004
you have to acknowledge this by using the following check-box.

I hereby declare this contribution to be licenced under the Apache License Version 2.0, January 2004

eolivelli · 2023-10-27T14:34:02Z

This is the PR to add jvector in Cassandra
https://github.com/apache/cassandra/pull/2673/files

eolivelli added 2 commits October 12, 2023 00:53

Build docker on m1

991fbea

Use JVector to index Vetors of floats - POC

88dab94

eolivelli mentioned this pull request Oct 13, 2023

Is there a way to build an index while keeping it on disk (GraphIndexBuilder + OnDiskGraphIndex ?) jbellis/jvector#125

Closed

Upgrade to JVector 1.0.2

b2627cb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use JVector to index Vetors of floats - POC #814

Use JVector to index Vetors of floats - POC #814

eolivelli commented Oct 13, 2023 •

edited

Loading

eolivelli commented Oct 27, 2023

Use JVector to index Vetors of floats - POC #814

Are you sure you want to change the base?

Use JVector to index Vetors of floats - POC #814

Conversation

eolivelli commented Oct 13, 2023 • edited Loading

eolivelli commented Oct 27, 2023

eolivelli commented Oct 13, 2023 •

edited

Loading