Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move away from custom compoundformat #2536

Open
jmazanec15 opened this issue Feb 18, 2025 · 0 comments
Open

Move away from custom compoundformat #2536

jmazanec15 opened this issue Feb 18, 2025 · 0 comments
Labels
Refactoring Improve the design, structure, and implementation while preserving its functionality v3.0.0

Comments

@jmazanec15
Copy link
Member

jmazanec15 commented Feb 18, 2025

Description

Recently, for native libraries, we introduced a change to interact with the files via indexinput and indexoutput. With this, we should be able to remove our custom compoundformat in our codec (see #2185).

However, when removing it, we get an error like:

Caused by: org.apache.lucene.index.CorruptIndexException: compound sub-files must have a valid codec header and footer: codec header mismatch: actual header=1232620912 vs expected header=1071082519 (resource=BufferedChecksumIndexInput(MemorySegmentIndexInput(path="/Users/jmazane/workspace/Opensearch/DockerRunner/k-NN-1/build/testclusters/integTest-0/data/nodes/0/indices/6zG12XzjQLaWyWiAL_OFaQ/0/index/_0_165_test_nested.test_vector.faiss")))
        at org.apache.lucene.codecs.CodecUtil.verifyAndCopyIndexHeader(CodecUtil.java:287) ~[lucene-core-10.1.0.jar:10.1.0 884954006de769dc43b811267230d625886e6515 - 2024-12-17 16:15:44]

This is because for the native index files we write a footer but no header: https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/codec/nativeindex/NativeIndexWriter.java#L141-L150. See CompoundFormat interface: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/CompoundFormat.java#L41-L45.

We should get rid of the CompoundFormat so we can move towards just extending the PerFieldVectorFormat. To do this, we need to write the header, and make sure to read it before forwarding on the output to the underlying libraries.

From a bwc perspective, for old codecs, we will need to keep around the old CompoundFormat. But we should be able to remove on new codecs.

@jmazanec15 jmazanec15 added the Refactoring Improve the design, structure, and implementation while preserving its functionality label Feb 18, 2025
@jmazanec15 jmazanec15 added v3.0.0 Refactoring Improve the design, structure, and implementation while preserving its functionality and removed Refactoring Improve the design, structure, and implementation while preserving its functionality untriaged labels Feb 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Refactoring Improve the design, structure, and implementation while preserving its functionality v3.0.0
Projects
None yet
Development

No branches or pull requests

1 participant