Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support custom codecs via stored fields #2547

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jmazanec15
Copy link
Member

Description

Following up on discussion in #2414. @bugmakerrrrrr had idea to serialize info about delegate as segment attribute. Something similar is done in https://github.com/opensearch-project/custom-codecs/blob/2.19.0.0/src/main/java/org/opensearch/index/codec/customcodecs/Lucene912CustomStoredFieldsFormat.java#L95-L100.

One short coming with this approach is that it only works for stored fields formats. We could generalize to write in segmentinfoformat. However, with that, I think we would need to create a custom format for all formats, which seems a bit heavy. So, just focusing on segmentinfoformat for now.

Still need to write some tests, but wanted to see what people thought early.

Related Issues

Resolves #2414

Check List

  • Commits are signed per the DCO using --signoff.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Adds a segment info attribute that stores the delegate codec name as a
segment info attribute so that custom codecs can be used in conjunction
with the knn codec.

Signed-off-by: John Mazanec <[email protected]>
Signed-off-by: John Mazanec <[email protected]>
@jmazanec15 jmazanec15 changed the title Support custom codecs via stored fields. Support custom codecs via stored fields Feb 20, 2025
@navneet1v
Copy link
Collaborator

navneet1v commented Feb 20, 2025

@jmazanec15 and @bugmakerrrrrr this looks like a very small change, but I want us to think little bit on the scenario when k-NN doesn't provide its own codec. This I am envisioning will happen once we move KNNVector field to core: #1467. Once the KNNVectorField move to core the codec will also move there and if Core is giving the codec then I am little bit concerned if this approach will acceptable in core? and also how the whole thing will play out?

I am thinking about this because since this change will change the segmentInfo which is read by multiple different systems.

@jmazanec15
Copy link
Member Author

As long as we have a custom codec, we need to persist delegate information in order to properly create delegate on case of loading from SPI. I dont think there is a better way.

If there is a per field vector format extension point, we will still need the custom stored fields format to support derived source, so we will still need the custom codec. Further, I think for all of 3.x we will need the custom codec anyway to support upgrades from 2.x.

@bugmakerrrrrr
Copy link
Contributor

Once the KNNVectorField move to core the codec will also move there and if Core is giving the codec then I am little bit concerned if this approach will acceptable in core? and also how the whole thing will play out?

Even if knn field is moved to the core, the core will necessarily need to provide a way for the knn field to work with other codecs. And as long as the core provides the user with functionality consistent with what the current PR provides, it should be fine for the user. And for the existing indices, since the relevant information has been persisted, it can also be read normally.

@bugmakerrrrrr
Copy link
Contributor

As long as we have a custom codec, we need to persist delegate information in order to properly create delegate on case of loading from SPI. I dont think there is a better way.

If there is a per field vector format extension point, we will still need the custom stored fields format to support derived source, so we will still need the custom codec. Further, I think for all of 3.x we will need the custom codec anyway to support upgrades from 2.x.

+1

Copy link
Contributor

@bugmakerrrrrr bugmakerrrrrr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancements Increases software capabilities beyond original client specifications skip-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Using KNN codec in conjunction with custom codec
3 participants