[RFC] Segments Free Vector Search in OpenSearch #2538

navneet1v · 2025-02-18T20:23:59Z

Note: The RFC is still work in progress and will keep on getting updates in upcoming days.

Introduction

This issue lays down high level details about freeing up the Vector Search in OpenSearch from Lucene Segments for Native Engines(Faiss). It first establish the pain points of segments with vector search and then proposes a new architecture for k-NN plugin(responsible for doing Vector Search). The document will then lay down the integration with different other process of OpenSearch like snapshots, recovery etc to ensure the resiliency of this new architecture. These sections require more thought and will keep on getting evolved with time.

Current Architecture

K-NN plugin in OpenSearch supports 3 different type of engines to perform the Approximate Nearest Neighbor(ANN) Search. Engines is just an abstraction provided by the plugin over what downstream libraries which are used to do the Nearest Neighbor Search. Currently plugin has Lucene(Java Implementation), Faiss(C++ implementation) and Nmslib(C++ implementation) as 3 different engines.
Every engine supports various algorithms do the Search. On high level we support:

Lucene: HNSW algorithm
Nmslib(Native Engine): HNSW algorithm : Engine marked as deprecated in 2.19 version of OpenSearch
Faiss(Native Engine): HNSW and IVF. : Default Engine since 2.18 version of OpenSearch

On a very high level, an OpenSearch index or Vector Index data is stored in shards. Shards are nothing but Lucene indices. Each shard is further divided into segments which are created during ingestion. These segments are immutable once they are created. For indices that have K-NN fields in it the architecture is same at Opensearch and Lucene level. K-NN plugin uses the same architecture to support the Approximate Nearest Neighbors Search. At a very high level during the segment creation, apart from creating all the different data structures(Like FST, BKDs, DocValues etc) needed for different fields, for a vector field K-NN plugin creates the vector related data structures per vector field. These files are written down as segment files and tracked by Lucene.
While performing the ANN Search, we load these vector data structures files into Memory(not JVM Heap) if not present already and then perform the search using respective libraries.

![Image](a)

Pain Points with Lucene Segments

As of 2.18 version of K-NN plugin in Opensearch, vector data structures are built per lucene segment. This is coming from Lucene library. Most of the time to get better performance customers have to merge down the segments, which is an expensive process. Even with features like Concurrent Segment Search, we see improved performance but this is no where close to single segment performance and throughput.

Why we should be solving the above problems?

Given the competition and innovations happening in the Vector Search space it is the need to hour. Moving to a new architecture is never easy and in recent years we have given fair shot to solve these problems like to solve segments problems with Concurrent Segment search, greedy creation of vector data-structures etc. We moved the needle but those were more of patches to the problem and not a perfect solution.

Features of a Ideal Solution

The proposed solution should handle all the current k-NN query level use-cases which includes hybrid search, complex nested queries, efficient filtering and different types of quantizations.
The solution should follow the basic Opensearch operations like snapshots, recovery, replicas for durability.

High Level Architecture

In the below section I have just added the picture but not a lot of details, I will keep on adding more details in upcoming days.

Flush Flow

Lucene DocId to VectorId Mapper: A small translation layer that maps the segment_id, docId to VectorIndex DocId. This will be persisted with the segment file and will be used during queries to reverse map the VectorIndex DocId to segment and segment id.
Native Index Service/Component: A component abstracting the logic on how to interact with a single faiss index at the shard level. It has multiple internal components like write ahead log etc. The details on those components will be added later.

Merge Flow

Search Flow

TBA

Snapshots

TBA

Shard Initialization and Recovery

TBA

Persisting Vector Index on File System

TBA

navneet1v self-assigned this Feb 18, 2025

opensearch-infra bot added this to OpenSearch Roadmap Feb 18, 2025

github-project-automation bot moved this to New in OpenSearch Roadmap Feb 18, 2025

github-actions bot added the untriaged label Feb 18, 2025

jmazanec15 removed the untriaged label Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Segments Free Vector Search in OpenSearch #2538

[RFC] Segments Free Vector Search in OpenSearch #2538

navneet1v commented Feb 18, 2025

[RFC] Segments Free Vector Search in OpenSearch #2538

[RFC] Segments Free Vector Search in OpenSearch #2538

Comments

navneet1v commented Feb 18, 2025

Introduction

Current Architecture

Pain Points with Lucene Segments

Why we should be solving the above problems?

Features of a Ideal Solution

High Level Architecture

Flush Flow

Merge Flow

Search Flow

Snapshots

Shard Initialization and Recovery

Persisting Vector Index on File System