-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature branch] Lower bounds for min-max normalization in hybrid query #1195
Conversation
Signed-off-by: Martin Gaievski <[email protected]>
Signed-off-by: Martin Gaievski <[email protected]>
Signed-off-by: Martin Gaievski <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## feature/lower_bounds_for_minmax_normalization #1195 +/- ##
===================================================================================
+ Coverage 81.74% 81.79% +0.05%
- Complexity 2509 2606 +97
===================================================================================
Files 190 190
Lines 8560 8922 +362
Branches 1436 1520 +84
===================================================================================
+ Hits 6997 7298 +301
- Misses 1006 1034 +28
- Partials 557 590 +33 ☔ View full report in Codecov by Sentry. |
dbf2790
to
89999f9
Compare
@@ -57,7 +57,7 @@ public final class HybridQueryBuilder extends AbstractQueryBuilder<HybridQueryBu | |||
|
|||
private Integer paginationDepth; | |||
|
|||
static final int MAX_NUMBER_OF_SUB_QUERIES = 5; | |||
public static final int MAX_NUMBER_OF_SUB_QUERIES = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
making this public to read and make validation in other places down the stream
c1afd64
to
199798d
Compare
Signed-off-by: Martin Gaievski <[email protected]>
199798d
to
70a59c0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completed 1st round of review
...a/org/opensearch/neuralsearch/processor/normalization/MinMaxScoreNormalizationTechnique.java
Show resolved
Hide resolved
if (firstDoc.doc != secondDoc.doc || Float.compare(firstDoc.score, secondDoc.score) != 0) { | ||
return false; | ||
} | ||
if (firstDoc instanceof FieldDoc != secondDoc instanceof FieldDoc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we extract fielddocs validation seperately in method compareFieldDocs? and at root level itself check for instanceOf and call respective methid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I not sure why that's needed
...a/org/opensearch/neuralsearch/processor/normalization/MinMaxScoreNormalizationTechnique.java
Outdated
Show resolved
Hide resolved
...a/org/opensearch/neuralsearch/processor/normalization/MinMaxScoreNormalizationTechnique.java
Outdated
Show resolved
Hide resolved
...a/org/opensearch/neuralsearch/processor/normalization/MinMaxScoreNormalizationTechnique.java
Outdated
Show resolved
Hide resolved
...a/org/opensearch/neuralsearch/processor/normalization/MinMaxScoreNormalizationTechnique.java
Outdated
Show resolved
Hide resolved
...a/org/opensearch/neuralsearch/processor/normalization/MinMaxScoreNormalizationTechnique.java
Show resolved
Hide resolved
...a/org/opensearch/neuralsearch/processor/normalization/MinMaxScoreNormalizationTechnique.java
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/processor/normalization/ScoreNormalizationUtil.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Martin Gaievski <[email protected]>
… conditions Signed-off-by: Martin Gaievski <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a look except the test file and rest LGTM
.../java/org/opensearch/neuralsearch/processor/normalization/L2ScoreNormalizationTechnique.java
Show resolved
Hide resolved
...a/org/opensearch/neuralsearch/processor/normalization/MinMaxScoreNormalizationTechnique.java
Show resolved
Hide resolved
...a/org/opensearch/neuralsearch/processor/normalization/MinMaxScoreNormalizationTechnique.java
Outdated
Show resolved
Hide resolved
...a/org/opensearch/neuralsearch/processor/normalization/MinMaxScoreNormalizationTechnique.java
Show resolved
Hide resolved
...a/org/opensearch/neuralsearch/processor/normalization/MinMaxScoreNormalizationTechnique.java
Outdated
Show resolved
Hide resolved
public float normalize(float score, float minScore, float maxScore, float lowerBoundScore) { | ||
// if we apply the lower bound this mean we use actual score in case it's less then the lower bound min score | ||
// same applied to case when actual max_score is less than lower bound min score | ||
if (maxScore < lowerBoundScore || score < lowerBoundScore) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we checking maxScore
here? Shouldn't it be minScore
that needs to be compared with lowerBoundScore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an edge case I bumped into while working on implementation. Imagine most of your docs are not very relevant to the query and the lower bound min_score is relatively high - this is the scenario we're catching here. We effectively falling back to traditional min-max for the lack of better option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we also check for the case if minScore < lowerBoundScore? In that case we could use min_score itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be common case, minScore should be lower then the lowerBound score in general. In this case we mainly case how actual score compares with lowerBound score, not minScore with lowerBound score. In case doc score is >= then lowerBound score we use lowerBound as min score:
(score - lowerBoundScore) / (maxScore - lowerBoundScore);
in case doc score < lowerBound score then we keep raw min-max formula:
(score - minScore) / (maxScore - minScore)
...a/org/opensearch/neuralsearch/processor/normalization/MinMaxScoreNormalizationTechnique.java
Show resolved
Hide resolved
|
||
private float extractAndValidateMinScore(Map<String, Object> lowerBound) { | ||
Object minScoreObj = lowerBound.get(PARAM_NAME_LOWER_BOUND_MIN_SCORE); | ||
if (minScoreObj == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A question here. If a user is defining lower bounds then shouldn't we mandate them to provide a min score otherwise is there any point to define lower bound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that was one of the asks when we reviewed the design/RFC, we can use default of 0.0 that is giving reasonable results and simplify the interface for end user
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the point of defining a lower bound if we are directing it to the min score?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what do you mean. I was referring to scenario when user sets just lower bound, in this case we use 0.0 score as min_score for that lower bound. Reason for it is a simpler request syntax and relatively good defaults for entry level users. User can be not aware of optimal min_score for lower bound, or can be fine with our defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in the scenario where user just define a lower bound but doesn't pass any min score then it will be directed to default value, right?
Wouldn't in such scenario rather than using the default min value we should have used the original min score we get from the sub query itself?
...a/org/opensearch/neuralsearch/processor/normalization/MinMaxScoreNormalizationTechnique.java
Show resolved
Hide resolved
if (score < minScore) { | ||
return 0.0f; | ||
} | ||
if (maxScore < lowerBoundScore) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please check my previous response, it's the case when user has the lower bound that is too high
Signed-off-by: Martin Gaievski <[email protected]>
...a/org/opensearch/neuralsearch/processor/normalization/MinMaxScoreNormalizationTechnique.java
Show resolved
Hide resolved
src/test/java/org/opensearch/neuralsearch/query/HybridQueryExplainIT.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM.
…ual max score Signed-off-by: Martin Gaievski <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall with few questions
3a6eab3
into
opensearch-project:feature/lower_bounds_for_minmax_normalization
Description
I'm adding
lower_bounds
parameter for min-max normalization technique of hybrid query.User will be able to create search pipeline with
lower_bounds
using following request:I'm planning to merge this into feature branch in main repo until app sec team give a signoff.
Related Issues
#150
Implemented design described in RFC #1189
PR for adding documentation: opensearch-project/documentation-website#9337
Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.