Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature branch] Lower bounds for min-max normalization in hybrid query #1195

Conversation

martin-gaievski
Copy link
Member

@martin-gaievski martin-gaievski commented Feb 24, 2025

Description

I'm adding lower_bounds parameter for min-max normalization technique of hybrid query.
User will be able to create search pipeline with lower_bounds using following request:

{
  "description": "Normalization processor for hybrid search",
  "phase_results_processors": [
    {
      "normalization-processor": {
        "normalization": {
          "technique": "min_max",
          "parameters": {
            "lower_bounds": [
              { 
                "mode": "apply",
                "min_score": 0.1
              }, 
                "mode": "clip",
                "min_score": 0.1
              }, 
                "mode": "ignore"
              }
            ]
          }
        },
        "combination": {
          "technique": "arithmetic_mean"
        }
      }
    }
  ]
}

I'm planning to merge this into feature branch in main repo until app sec team give a signoff.

Related Issues

#150
Implemented design described in RFC #1189
PR for adding documentation: opensearch-project/documentation-website#9337

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@martin-gaievski martin-gaievski changed the title Lower bounds for min-max normalization in hybrid query [To Feature branch] Lower bounds for min-max normalization in hybrid query Feb 24, 2025
Copy link

codecov bot commented Feb 25, 2025

Codecov Report

Attention: Patch coverage is 83.85417% with 31 lines in your changes missing coverage. Please review.

Project coverage is 81.79%. Comparing base (c36ca15) to head (c6d179e).
Report is 3 commits behind head on feature/lower_bounds_for_minmax_normalization.

Files with missing lines Patch % Lines
...search/neuralsearch/processor/CompoundTopDocs.java 55.55% 8 Missing and 12 partials ⚠️
...rmalization/MinMaxScoreNormalizationTechnique.java 92.92% 2 Missing and 5 partials ⚠️
...rocessor/normalization/ScoreNormalizationUtil.java 89.47% 1 Missing and 3 partials ⚠️
Additional details and impacted files
@@                                 Coverage Diff                                 @@
##             feature/lower_bounds_for_minmax_normalization    #1195      +/-   ##
===================================================================================
+ Coverage                                            81.74%   81.79%   +0.05%     
- Complexity                                            2509     2606      +97     
===================================================================================
  Files                                                  190      190              
  Lines                                                 8560     8922     +362     
  Branches                                              1436     1520      +84     
===================================================================================
+ Hits                                                  6997     7298     +301     
- Misses                                                1006     1034      +28     
- Partials                                               557      590      +33     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@martin-gaievski martin-gaievski force-pushed the feature/lower_bounds_for_minmax_normalization branch from dbf2790 to 89999f9 Compare February 25, 2025 01:17
@@ -57,7 +57,7 @@ public final class HybridQueryBuilder extends AbstractQueryBuilder<HybridQueryBu

private Integer paginationDepth;

static final int MAX_NUMBER_OF_SUB_QUERIES = 5;
public static final int MAX_NUMBER_OF_SUB_QUERIES = 5;
Copy link
Member Author

@martin-gaievski martin-gaievski Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

making this public to read and make validation in other places down the stream

@martin-gaievski martin-gaievski force-pushed the feature/lower_bounds_for_minmax_normalization branch 8 times, most recently from c1afd64 to 199798d Compare February 26, 2025 01:49
@martin-gaievski martin-gaievski force-pushed the feature/lower_bounds_for_minmax_normalization branch from 199798d to 70a59c0 Compare February 26, 2025 18:33
@martin-gaievski martin-gaievski changed the title [To Feature branch] Lower bounds for min-max normalization in hybrid query [Feature branch] Lower bounds for min-max normalization in hybrid query Feb 26, 2025
@martin-gaievski martin-gaievski marked this pull request as ready for review February 26, 2025 22:46
Copy link
Member

@vibrantvarun vibrantvarun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed 1st round of review

if (firstDoc.doc != secondDoc.doc || Float.compare(firstDoc.score, secondDoc.score) != 0) {
return false;
}
if (firstDoc instanceof FieldDoc != secondDoc instanceof FieldDoc) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we extract fielddocs validation seperately in method compareFieldDocs? and at root level itself check for instanceOf and call respective methid

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I not sure why that's needed

Copy link
Member

@vibrantvarun vibrantvarun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@martin-gaievski martin-gaievski added the v3.0.0 v3.0.0 label Feb 27, 2025
Copy link
Member

@owaiskazi19 owaiskazi19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look except the test file and rest LGTM

public float normalize(float score, float minScore, float maxScore, float lowerBoundScore) {
// if we apply the lower bound this mean we use actual score in case it's less then the lower bound min score
// same applied to case when actual max_score is less than lower bound min score
if (maxScore < lowerBoundScore || score < lowerBoundScore) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we checking maxScore here? Shouldn't it be minScore that needs to be compared with lowerBoundScore

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an edge case I bumped into while working on implementation. Imagine most of your docs are not very relevant to the query and the lower bound min_score is relatively high - this is the scenario we're catching here. We effectively falling back to traditional min-max for the lack of better option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we also check for the case if minScore < lowerBoundScore? In that case we could use min_score itself

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be common case, minScore should be lower then the lowerBound score in general. In this case we mainly case how actual score compares with lowerBound score, not minScore with lowerBound score. In case doc score is >= then lowerBound score we use lowerBound as min score:

(score - lowerBoundScore) / (maxScore - lowerBoundScore);

in case doc score < lowerBound score then we keep raw min-max formula:

(score - minScore) / (maxScore - minScore)


private float extractAndValidateMinScore(Map<String, Object> lowerBound) {
Object minScoreObj = lowerBound.get(PARAM_NAME_LOWER_BOUND_MIN_SCORE);
if (minScoreObj == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question here. If a user is defining lower bounds then shouldn't we mandate them to provide a min score otherwise is there any point to define lower bound?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that was one of the asks when we reviewed the design/RFC, we can use default of 0.0 that is giving reasonable results and simplify the interface for end user

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of defining a lower bound if we are directing it to the min score?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what do you mean. I was referring to scenario when user sets just lower bound, in this case we use 0.0 score as min_score for that lower bound. Reason for it is a simpler request syntax and relatively good defaults for entry level users. User can be not aware of optimal min_score for lower bound, or can be fine with our defaults.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in the scenario where user just define a lower bound but doesn't pass any min score then it will be directed to default value, right?
Wouldn't in such scenario rather than using the default min value we should have used the original min score we get from the sub query itself?

if (score < minScore) {
return 0.0f;
}
if (maxScore < lowerBoundScore) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please check my previous response, it's the case when user has the lower bound that is too high

Signed-off-by: Martin Gaievski <[email protected]>
Copy link
Member

@junqiu-lei junqiu-lei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM.

Copy link
Member

@owaiskazi19 owaiskazi19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall with few questions

@martin-gaievski martin-gaievski merged commit 3a6eab3 into opensearch-project:feature/lower_bounds_for_minmax_normalization Feb 28, 2025
51 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v3.0.0 v3.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants