-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support custom codecs via stored fields #2547
base: main
Are you sure you want to change the base?
Conversation
Adds a segment info attribute that stores the delegate codec name as a segment info attribute so that custom codecs can be used in conjunction with the knn codec. Signed-off-by: John Mazanec <[email protected]>
Signed-off-by: John Mazanec <[email protected]>
f0a57c4
to
2a83f33
Compare
@jmazanec15 and @bugmakerrrrrr this looks like a very small change, but I want us to think little bit on the scenario when k-NN doesn't provide its own codec. This I am envisioning will happen once we move KNNVector field to core: #1467. Once the KNNVectorField move to core the codec will also move there and if Core is giving the codec then I am little bit concerned if this approach will acceptable in core? and also how the whole thing will play out? I am thinking about this because since this change will change the segmentInfo which is read by multiple different systems. |
As long as we have a custom codec, we need to persist delegate information in order to properly create delegate on case of loading from SPI. I dont think there is a better way. If there is a per field vector format extension point, we will still need the custom stored fields format to support derived source, so we will still need the custom codec. Further, I think for all of 3.x we will need the custom codec anyway to support upgrades from 2.x. |
Even if knn field is moved to the core, the core will necessarily need to provide a way for the knn field to work with other codecs. And as long as the core provides the user with functionality consistent with what the current PR provides, it should be fine for the user. And for the existing indices, since the relevant information has been persisted, it can also be read normally. |
+1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
Following up on discussion in #2414. @bugmakerrrrrr had idea to serialize info about delegate as segment attribute. Something similar is done in https://github.com/opensearch-project/custom-codecs/blob/2.19.0.0/src/main/java/org/opensearch/index/codec/customcodecs/Lucene912CustomStoredFieldsFormat.java#L95-L100.
One short coming with this approach is that it only works for stored fields formats. We could generalize to write in segmentinfoformat. However, with that, I think we would need to create a custom format for all formats, which seems a bit heavy. So, just focusing on segmentinfoformat for now.
Still need to write some tests, but wanted to see what people thought early.
Related Issues
Resolves #2414
Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.