[RFC] Segments Free Vector Search in OpenSearch #2538
Labels
enhancement
Features
Introduces a new unit of functionality that satisfies a requirement
indexing-improvements
This label should be attached to all the github issues which will help improving the indexing time.
performance
Make it fast!
RFC
Request for comments
search-improvements
Note: The RFC is still work in progress and will keep on getting updates in upcoming days.
Introduction
This issue lays down high level details about freeing up the Vector Search in OpenSearch from Lucene Segments for Native Engines(Faiss). It first establish the pain points of segments with vector search and then proposes a new architecture for k-NN plugin(responsible for doing Vector Search). The document will then lay down the integration with different other process of OpenSearch like snapshots, recovery etc to ensure the resiliency of this new architecture. These sections require more thought and will keep on getting evolved with time.
Current Architecture
K-NN plugin in OpenSearch supports 3 different type of engines to perform the Approximate Nearest Neighbor(ANN) Search. Engines is just an abstraction provided by the plugin over what downstream libraries which are used to do the Nearest Neighbor Search. Currently plugin has Lucene(Java Implementation), Faiss(C++ implementation) and Nmslib(C++ implementation) as 3 different engines.
Every engine supports various algorithms do the Search. On high level we support:
On a very high level, an OpenSearch index or Vector Index data is stored in shards. Shards are nothing but Lucene indices. Each shard is further divided into segments which are created during ingestion. These segments are immutable once they are created. For indices that have K-NN fields in it the architecture is same at Opensearch and Lucene level. K-NN plugin uses the same architecture to support the Approximate Nearest Neighbors Search. At a very high level during the segment creation, apart from creating all the different data structures(Like FST, BKDs, DocValues etc) needed for different fields, for a vector field K-NN plugin creates the vector related data structures per vector field. These files are written down as segment files and tracked by Lucene.
While performing the ANN Search, we load these vector data structures files into Memory(not JVM Heap) if not present already and then perform the search using respective libraries.
Pain Points with Lucene Segments
As of 2.18 version of K-NN plugin in Opensearch, vector data structures are built per lucene segment. This is coming from Lucene library. Most of the time to get better performance customers have to merge down the segments, which is an expensive process. Even with features like Concurrent Segment Search, we see improved performance but this is no where close to single segment performance and throughput.
Why we should be solving the above problems?
Given the competition and innovations happening in the Vector Search space it is the need to hour. Moving to a new architecture is never easy and in recent years we have given fair shot to solve these problems like to solve segments problems with Concurrent Segment search, greedy creation of vector data-structures etc. We moved the needle but those were more of patches to the problem and not a perfect solution.
Features of a Ideal Solution
High Level Architecture
In the below section I have just added the picture but not a lot of details, I will keep on adding more details in upcoming days.
Flush Flow
Lucene DocId to VectorId Mapper: A small translation layer that maps the segment_id, docId to VectorIndex DocId. This will be persisted with the segment file and will be used during queries to reverse map the VectorIndex DocId to segment and segment id.
Native Index Service/Component: A component abstracting the logic on how to interact with a single faiss index at the shard level. It has multiple internal components like write ahead log etc. The details on those components will be added later.
Merge Flow
Search Flow
TBA
Snapshots
TBA
Shard Initialization and Recovery
TBA
Persisting Vector Index on File System
TBA
The text was updated successfully, but these errors were encountered: