[FEATURE] Reuse KNNVectorFieldData for reduce disk usage #1572
Labels
enhancement
indexing-improvements
This label should be attached to all the github issues which will help improving the indexing time.
Roadmap:Vector Database/GenAI
Project-wide roadmap label
storage-improvements
v2.19.0
Description
in some scenarios, we want to
reduce the disk usage
andio throughput
for the source field. so, we would excludes knn fields in mapping which do not store the source like( this would make knn field can not be retrieve and rebuild)so I propose to use doc_values field for the vector fields. like:
Proposal
KNNVectorDVLeafFieldData
get data from docvaluesi rewrite
KNNVectorDVLeafFieldData
and makeKNN80BinaryDocValues
can return the specific knndocvalue_fields
like: (vector_field1
is knn field type)optimize result:
1m SIFT dataset, 1 shard,
with source store: 1389MB
without source store: 1055MB(-24%)
for the continues dive in to
knndocvalues
fields, I think when use faiss engine, we can usereconstruct_n
interface to retrieve the specific doc values and save the disk usage forBinaryDocValuesFormat
. or like this issue comments for redesign aKnnVectorsFormat
I added
KNNFetchSubPhase
and add a processor likeFetchSourcePhase#FetchSubPhaseProcessor
to combine thedocvalue_fields
into_source
something likesynthetic
logicDo you have any additional context?
This talk at issue #1087 and there is some other ideas
My PR is #1571
for the continues dive in to
knndocvalues
fields, I think when use faiss engine, we can usereconstruct_n
interface to retrieve the specific doc values and save the disk usage forBinaryDocValuesFormat
. or like #1087 we can use KnnVectorsFormat.BUT The idea I want to show is just reduce the disk usage and there is a issue opensearch-project/OpenSearch#6356 talked about it, and as far as possible keep the source which reindex needed. I think the PR #1571 just reduce the disk usage and keep the source like a
synthetic
wayThe text was updated successfully, but these errors were encountered: