Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Reuse KNNVectorFieldData for reduce disk usage #1572

Open
luyuncheng opened this issue Mar 20, 2024 · 1 comment
Open

[FEATURE] Reuse KNNVectorFieldData for reduce disk usage #1572

luyuncheng opened this issue Mar 20, 2024 · 1 comment
Assignees
Labels
enhancement indexing-improvements This label should be attached to all the github issues which will help improving the indexing time. Roadmap:Vector Database/GenAI Project-wide roadmap label storage-improvements v2.19.0

Comments

@luyuncheng
Copy link
Collaborator

luyuncheng commented Mar 20, 2024

Description

in some scenarios, we want to reduce the disk usage and io throughput for the source field. so, we would excludes knn fields in mapping which do not store the source like( this would make knn field can not be retrieve and rebuild)

"mappings": { 
  "_source": { 
    "excludes": [
      "target_field1",
      "target_field2",
     ]
  }
}

so I propose to use doc_values field for the vector fields. like:

POST some_index/_search
{
  "docvalue_fields": [
    "vector_field1",
    "vector_field2",
  ],
  "_source": false
}'

Proposal

  1. Rewrite KNNVectorDVLeafFieldData get data from docvalues

i rewrite KNNVectorDVLeafFieldData and make KNN80BinaryDocValues can return the specific knn docvalue_fields like: (vector_field1 is knn field type)

"hits":[{"_index":"test","_id":"1","_score":1.0,"fields":{"vector_field1":["1.5","2.5"]}},{"_index":"test","_id":"2","_score":1.0,"fields":{"vector_field1":["2.5","1.5"]}}]

optimize result:
1m SIFT dataset, 1 shard,
with source store: 1389MB
without source store: 1055MB(-24%)

for the continues dive in to knndocvalues fields, I think when use faiss engine, we can use reconstruct_n interface to retrieve the specific doc values and save the disk usage for BinaryDocValuesFormat. or like this issue comments for redesign a KnnVectorsFormat

  1. composite vector field to _source

I added KNNFetchSubPhase and add a processor like FetchSourcePhase#FetchSubPhaseProcessor to combine the docvalue_fields into _source something like synthetic logic

Do you have any additional context?
This talk at issue #1087 and there is some other ideas
My PR is #1571

for the continues dive in to knndocvalues fields, I think when use faiss engine, we can use reconstruct_n interface to retrieve the specific doc values and save the disk usage for BinaryDocValuesFormat. or like #1087 we can use KnnVectorsFormat.

BUT The idea I want to show is just reduce the disk usage and there is a issue opensearch-project/OpenSearch#6356 talked about it, and as far as possible keep the source which reindex needed. I think the PR #1571 just reduce the disk usage and keep the source like a synthetic way

@vamshin vamshin moved this from Backlog to 2.14.0 in Vector Search RoadMap Mar 22, 2024
@vamshin vamshin added v2.14.0 and removed untriaged labels Mar 27, 2024
@navneet1v navneet1v added the indexing-improvements This label should be attached to all the github issues which will help improving the indexing time. label Apr 9, 2024
@navneet1v navneet1v changed the title [FEATURE]Reuse KNNVectorFieldData for reduce disk usage [FEATURE] Reuse KNNVectorFieldData for reduce disk usage Apr 9, 2024
@jmazanec15
Copy link
Member

I think we are going to need to push this to 2.15.

@vamshin vamshin moved this from 2.14.0 to 2.17.0 in Vector Search RoadMap Jul 25, 2024
@naveentatikonda naveentatikonda moved this from 2.17.0 to Now(This Quarter) in Vector Search RoadMap Aug 20, 2024
@vamshin vamshin added indexing-improvements This label should be attached to all the github issues which will help improving the indexing time. Roadmap:Vector Database/GenAI Project-wide roadmap label and removed indexing-improvements This label should be attached to all the github issues which will help improving the indexing time. labels Oct 4, 2024
@vamshin vamshin added the v2.19.0 label Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement indexing-improvements This label should be attached to all the github issues which will help improving the indexing time. Roadmap:Vector Database/GenAI Project-wide roadmap label storage-improvements v2.19.0
Projects
Status: 2.19.0
Development

No branches or pull requests

5 participants