Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace need for KnnVectorValues.copy() with a dictionary interface #13831

Open
msokolov opened this issue Sep 28, 2024 · 6 comments
Open

Replace need for KnnVectorValues.copy() with a dictionary interface #13831

msokolov opened this issue Sep 28, 2024 · 6 comments

Comments

@msokolov
Copy link
Contributor

Description

The idea is to get rid of copy() since it does more than is needed. What's needed is an independent vector accessor over the same values source. Adrien had suggested a dictionary() interface that would take over the random-access API from the values source (or in addition to it?)
See discussion in #13779 for more details.

@msokolov
Copy link
Contributor Author

msokolov commented Oct 1, 2024

I started looking at this and realized that we would really want to wait for another major release for it. Otherwise we'd have a deprecated method and have to carry two parallel implementations on 10.x which would be kind of annoying. This shouldn't be so hard but because we copy our codec code every time we have a minor release of the codec we end up having to do a lot of copies of the same change

@ChrisHegarty
Copy link
Contributor

Unless I'm mistaken, then KnnVectorValues and friends are "lucene.experimental" APIs. So, while maybe a little annoying, can be changed in a future 10.x release. No need to wait for a major release.

To make this more concrete, we can remove the copy method and add the Dictionary interface in, say 10.1.0, if we choose to do so.

@msokolov
Copy link
Contributor Author

msokolov commented Oct 3, 2024

ah, thanks good point, I'm forging ahead now

@msokolov
Copy link
Contributor Author

msokolov commented Oct 7, 2024

I worked this up and I think it will be an improvement - the basic idea is don't copy the vectorvalues, but copy a smaller wrapper around any shared scratch vector data (and clone any underlying IndexInput slices). I have it all working except for the MemorySegment-based scorers. I get TestVectorScorer.copiesAcrossThreads failures. I've checked carefully that the two scorers are different, each with its own clone of the underlying memory-segment based IndexInput slices and I'm scratching my head trying to figure out what I might have broken. I do see that the MemorySegmentIndexInput clones share their underlying MemorySegments - I hope that's OK? Any surmises would be helpful. If it helps I can post a PR with bugs - it's kind of big though and I still have some doubts about naming of various things

@benwtrent
Copy link
Member

but copy a smaller wrapper around any shared scratch vector data

The scratch arrays also need new instances. You are doing this as well correct?

@msokolov
Copy link
Contributor Author

msokolov commented Oct 7, 2024

yes I had to restructure Lucene99MemorySegmentByteVectorScorerSupplier a bit so it would create the scratch arrays more granularly .. but I think the test is not actually using that class but DefaultFlatVectorScorer.ByteScoringSupplier. Oh in fact I think see my error - basically I made the copies in the ScorerSupplier rather than in the Scorer, so they were being "over shared". Phew, thanks! Will post a PR soon maybe in breaks from Apache CoC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants