Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Segments Free Vector Search in OpenSearch #2538

Open
navneet1v opened this issue Feb 18, 2025 · 0 comments
Open

[RFC] Segments Free Vector Search in OpenSearch #2538

navneet1v opened this issue Feb 18, 2025 · 0 comments
Assignees
Labels
enhancement Features Introduces a new unit of functionality that satisfies a requirement indexing-improvements This label should be attached to all the github issues which will help improving the indexing time. performance Make it fast! RFC Request for comments search-improvements

Comments

@navneet1v
Copy link
Collaborator

Note: The RFC is still work in progress and will keep on getting updates in upcoming days.


Introduction

This issue lays down high level details about freeing up the Vector Search in OpenSearch from Lucene Segments for Native Engines(Faiss). It first establish the pain points of segments with vector search and then proposes a new architecture for k-NN plugin(responsible for doing Vector Search). The document will then lay down the integration with different other process of OpenSearch like snapshots, recovery etc to ensure the resiliency of this new architecture. These sections require more thought and will keep on getting evolved with time.

Current Architecture

K-NN plugin in OpenSearch supports 3 different type of engines to perform the Approximate Nearest Neighbor(ANN) Search. Engines is just an abstraction provided by the plugin over what downstream libraries which are used to do the Nearest Neighbor Search. Currently plugin has Lucene(Java Implementation), Faiss(C++ implementation) and Nmslib(C++ implementation) as 3 different engines.
Every engine supports various algorithms do the Search. On high level we support:

  1. Lucene: HNSW algorithm
  2. Nmslib(Native Engine): HNSW algorithm : Engine marked as deprecated in 2.19 version of OpenSearch
  3. Faiss(Native Engine): HNSW and IVF. : Default Engine since 2.18 version of OpenSearch

On a very high level, an OpenSearch index or Vector Index data is stored in shards. Shards are nothing but Lucene indices. Each shard is further divided into segments which are created during ingestion. These segments are immutable once they are created. For indices that have K-NN fields in it the architecture is same at Opensearch and Lucene level. K-NN plugin uses the same architecture to support the Approximate Nearest Neighbors Search. At a very high level during the segment creation, apart from creating all the different data structures(Like FST, BKDs, DocValues etc) needed for different fields, for a vector field K-NN plugin creates the vector related data structures per vector field. These files are written down as segment files and tracked by Lucene.
While performing the ANN Search, we load these vector data structures files into Memory(not JVM Heap) if not present already and then perform the search using respective libraries.

![Image](a)

Pain Points with Lucene Segments

As of 2.18 version of K-NN plugin in Opensearch, vector data structures are built per lucene segment. This is coming from Lucene library. Most of the time to get better performance customers have to merge down the segments, which is an expensive process. Even with features like Concurrent Segment Search, we see improved performance but this is no where close to single segment performance and throughput.

Why we should be solving the above problems?

Given the competition and innovations happening in the Vector Search space it is the need to hour. Moving to a new architecture is never easy and in recent years we have given fair shot to solve these problems like to solve segments problems with Concurrent Segment search, greedy creation of vector data-structures etc. We moved the needle but those were more of patches to the problem and not a perfect solution.

Features of a Ideal Solution

  1. The proposed solution should handle all the current k-NN query level use-cases which includes hybrid search, complex nested queries, efficient filtering and different types of quantizations.
  2. The solution should follow the basic Opensearch operations like snapshots, recovery, replicas for durability.

High Level Architecture

In the below section I have just added the picture but not a lot of details, I will keep on adding more details in upcoming days.

Flush Flow

  1. Lucene DocId to VectorId Mapper: A small translation layer that maps the segment_id, docId to VectorIndex DocId. This will be persisted with the segment file and will be used during queries to reverse map the VectorIndex DocId to segment and segment id.

  2. Native Index Service/Component: A component abstracting the logic on how to interact with a single faiss index at the shard level. It has multiple internal components like write ahead log etc. The details on those components will be added later.

Merge Flow

Image

Search Flow

TBA

Snapshots

TBA

Shard Initialization and Recovery

TBA

Persisting Vector Index on File System

TBA

@navneet1v navneet1v added enhancement Features Introduces a new unit of functionality that satisfies a requirement indexing-improvements This label should be attached to all the github issues which will help improving the indexing time. performance Make it fast! RFC Request for comments search-improvements labels Feb 18, 2025
@navneet1v navneet1v self-assigned this Feb 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Features Introduces a new unit of functionality that satisfies a requirement indexing-improvements This label should be attached to all the github issues which will help improving the indexing time. performance Make it fast! RFC Request for comments search-improvements
Projects
Status: New
Development

No branches or pull requests

2 participants