[Feature branch] Lower bounds for min-max normalization in hybrid query #1195

martin-gaievski · 2025-02-24T23:58:28Z

Description

I'm adding lower_bounds parameter for min-max normalization technique of hybrid query.
User will be able to create search pipeline with lower_bounds using following request:

{
  "description": "Normalization processor for hybrid search",
  "phase_results_processors": [
    {
      "normalization-processor": {
        "normalization": {
          "technique": "min_max",
          "parameters": {
            "lower_bounds": [
              { 
                "mode": "apply",
                "min_score": 0.1
              }, 
                "mode": "clip",
                "min_score": 0.1
              }, 
                "mode": "ignore"
              }
            ]
          }
        },
        "combination": {
          "technique": "arithmetic_mean"
        }
      }
    }
  ]
}

I'm planning to merge this into feature branch in main repo until app sec team give a signoff.

Related Issues

#150
Implemented design described in RFC #1189
PR for adding documentation: opensearch-project/documentation-website#9337

Check List

New functionality includes testing.
New functionality has been documented.
API changes companion pull request created.
Commits are signed per the DCO using --signoff.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Martin Gaievski <[email protected]>

codecov · 2025-02-25T00:09:44Z

Codecov Report

Attention: Patch coverage is 83.85417% with 31 lines in your changes missing coverage. Please review.

Project coverage is 81.79%. Comparing base (c36ca15) to head (c6d179e).
Report is 3 commits behind head on feature/lower_bounds_for_minmax_normalization.

Files with missing lines	Patch %	Lines
...search/neuralsearch/processor/CompoundTopDocs.java	55.55%	8 Missing and 12 partials ⚠️
...rmalization/MinMaxScoreNormalizationTechnique.java	92.92%	2 Missing and 5 partials ⚠️
...rocessor/normalization/ScoreNormalizationUtil.java	89.47%	1 Missing and 3 partials ⚠️

Additional details and impacted files

@@                                 Coverage Diff                                 @@
##             feature/lower_bounds_for_minmax_normalization    #1195      +/-   ##
===================================================================================
+ Coverage                                            81.74%   81.79%   +0.05%     
- Complexity                                            2509     2606      +97     
===================================================================================
  Files                                                  190      190              
  Lines                                                 8560     8922     +362     
  Branches                                              1436     1520      +84     
===================================================================================
+ Hits                                                  6997     7298     +301     
- Misses                                                1006     1034      +28     
- Partials                                               557      590      +33

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

martin-gaievski · 2025-02-25T01:19:18Z

src/main/java/org/opensearch/neuralsearch/query/HybridQueryBuilder.java

@@ -57,7 +57,7 @@ public final class HybridQueryBuilder extends AbstractQueryBuilder<HybridQueryBu

    private Integer paginationDepth;

-    static final int MAX_NUMBER_OF_SUB_QUERIES = 5;
+    public static final int MAX_NUMBER_OF_SUB_QUERIES = 5;


making this public to read and make validation in other places down the stream

Signed-off-by: Martin Gaievski <[email protected]>

vibrantvarun

Completed 1st round of review

...a/org/opensearch/neuralsearch/processor/normalization/MinMaxScoreNormalizationTechnique.java

vibrantvarun · 2025-02-26T22:59:48Z

src/main/java/org/opensearch/neuralsearch/processor/CompoundTopDocs.java

+            if (firstDoc.doc != secondDoc.doc || Float.compare(firstDoc.score, secondDoc.score) != 0) {
+                return false;
+            }
+            if (firstDoc instanceof FieldDoc != secondDoc instanceof FieldDoc) {


Shall we extract fielddocs validation seperately in method compareFieldDocs? and at root level itself check for instanceOf and call respective methid

I not sure why that's needed

...a/org/opensearch/neuralsearch/processor/normalization/MinMaxScoreNormalizationTechnique.java

src/main/java/org/opensearch/neuralsearch/processor/normalization/ScoreNormalizationUtil.java

Signed-off-by: Martin Gaievski <[email protected]>

… conditions Signed-off-by: Martin Gaievski <[email protected]>

vibrantvarun

Looks good to me.

owaiskazi19

Took a look except the test file and rest LGTM

.../java/org/opensearch/neuralsearch/processor/normalization/L2ScoreNormalizationTechnique.java

...a/org/opensearch/neuralsearch/processor/normalization/MinMaxScoreNormalizationTechnique.java

owaiskazi19 · 2025-02-27T19:08:16Z

...a/org/opensearch/neuralsearch/processor/normalization/MinMaxScoreNormalizationTechnique.java

+                public float normalize(float score, float minScore, float maxScore, float lowerBoundScore) {
+                    // if we apply the lower bound this mean we use actual score in case it's less then the lower bound min score
+                    // same applied to case when actual max_score is less than lower bound min score
+                    if (maxScore < lowerBoundScore || score < lowerBoundScore) {


Why are we checking maxScore here? Shouldn't it be minScore that needs to be compared with lowerBoundScore

this is an edge case I bumped into while working on implementation. Imagine most of your docs are not very relevant to the query and the lower bound min_score is relatively high - this is the scenario we're catching here. We effectively falling back to traditional min-max for the lack of better option.

Shouldn't we also check for the case if minScore < lowerBoundScore? In that case we could use min_score itself

This should be common case, minScore should be lower then the lowerBound score in general. In this case we mainly case how actual score compares with lowerBound score, not minScore with lowerBound score. In case doc score is >= then lowerBound score we use lowerBound as min score:

(score - lowerBoundScore) / (maxScore - lowerBoundScore);

in case doc score < lowerBound score then we keep raw min-max formula:

(score - minScore) / (maxScore - minScore)

...a/org/opensearch/neuralsearch/processor/normalization/MinMaxScoreNormalizationTechnique.java

owaiskazi19 · 2025-02-27T19:16:36Z

...a/org/opensearch/neuralsearch/processor/normalization/MinMaxScoreNormalizationTechnique.java

+
+    private float extractAndValidateMinScore(Map<String, Object> lowerBound) {
+        Object minScoreObj = lowerBound.get(PARAM_NAME_LOWER_BOUND_MIN_SCORE);
+        if (minScoreObj == null) {


A question here. If a user is defining lower bounds then shouldn't we mandate them to provide a min score otherwise is there any point to define lower bound?

I think that was one of the asks when we reviewed the design/RFC, we can use default of 0.0 that is giving reasonable results and simplify the interface for end user

What's the point of defining a lower bound if we are directing it to the min score?

not sure what do you mean. I was referring to scenario when user sets just lower bound, in this case we use 0.0 score as min_score for that lower bound. Reason for it is a simpler request syntax and relatively good defaults for entry level users. User can be not aware of optimal min_score for lower bound, or can be fine with our defaults.

So in the scenario where user just define a lower bound but doesn't pass any min score then it will be directed to default value, right?
Wouldn't in such scenario rather than using the default min value we should have used the original min score we get from the sub query itself?

...a/org/opensearch/neuralsearch/processor/normalization/MinMaxScoreNormalizationTechnique.java

owaiskazi19 · 2025-02-27T19:19:42Z

...a/org/opensearch/neuralsearch/processor/normalization/MinMaxScoreNormalizationTechnique.java

+                    if (score < minScore) {
+                        return 0.0f;
+                    }
+                    if (maxScore < lowerBoundScore) {


please check my previous response, it's the case when user has the lower bound that is too high

Signed-off-by: Martin Gaievski <[email protected]>

...a/org/opensearch/neuralsearch/processor/normalization/MinMaxScoreNormalizationTechnique.java

src/test/java/org/opensearch/neuralsearch/query/HybridQueryExplainIT.java

junqiu-lei

Overall LGTM.

…ual max score Signed-off-by: Martin Gaievski <[email protected]>

owaiskazi19

Looks good overall with few questions

martin-gaievski added 3 commits February 17, 2025 14:30

Working draft with unit tests

9540799

Signed-off-by: Martin Gaievski <[email protected]>

Added integ test, adjust some calculations

b63f34f

Signed-off-by: Martin Gaievski <[email protected]>

Added check for number of elements in lower_bounds array

a983cfd

Signed-off-by: Martin Gaievski <[email protected]>

martin-gaievski changed the title ~~Lower bounds for min-max normalization in hybrid query~~ [To Feature branch] Lower bounds for min-max normalization in hybrid query Feb 24, 2025

martin-gaievski force-pushed the feature/lower_bounds_for_minmax_normalization branch from dbf2790 to 89999f9 Compare February 25, 2025 01:17

martin-gaievski commented Feb 25, 2025

View reviewed changes

martin-gaievski force-pushed the feature/lower_bounds_for_minmax_normalization branch 8 times, most recently from c1afd64 to 199798d Compare February 26, 2025 01:49

Added more validations and unit tests

70a59c0

Signed-off-by: Martin Gaievski <[email protected]>

martin-gaievski force-pushed the feature/lower_bounds_for_minmax_normalization branch from 199798d to 70a59c0 Compare February 26, 2025 18:33

martin-gaievski changed the title ~~[To Feature branch] Lower bounds for min-max normalization in hybrid query~~ [Feature branch] Lower bounds for min-max normalization in hybrid query Feb 26, 2025

martin-gaievski marked this pull request as ready for review February 26, 2025 22:46

martin-gaievski requested review from heemin32, navneet1v, VijayanB, vamshin, jmazanec15, naveentatikonda, junqiu-lei, sean-zheng-amazon, model-collapse, zane-neo and vibrantvarun as code owners February 26, 2025 22:46

martin-gaievski requested review from zhichao-aws, yuye-aws and minalsha as code owners February 26, 2025 22:46

vibrantvarun reviewed Feb 26, 2025

View reviewed changes

martin-gaievski added 2 commits February 26, 2025 18:29

Refactor getter for lowerBounds after code review comments

de3f9d4

Signed-off-by: Martin Gaievski <[email protected]>

Changed syntax from negation to explicit comparision with false in if…

adaf0eb

… conditions Signed-off-by: Martin Gaievski <[email protected]>

vibrantvarun approved these changes Feb 27, 2025

View reviewed changes

martin-gaievski added the v3.0.0 v3.0.0 label Feb 27, 2025

owaiskazi19 reviewed Feb 27, 2025

View reviewed changes

Addressing review comments, run 2

87c2d3e

Signed-off-by: Martin Gaievski <[email protected]>

junqiu-lei reviewed Feb 27, 2025

View reviewed changes

...a/org/opensearch/neuralsearch/processor/normalization/MinMaxScoreNormalizationTechnique.java Show resolved Hide resolved

junqiu-lei reviewed Feb 27, 2025

View reviewed changes

src/test/java/org/opensearch/neuralsearch/query/HybridQueryExplainIT.java Show resolved Hide resolved

junqiu-lei approved these changes Feb 27, 2025

View reviewed changes

Adding integ test for case when lower_bound score is greater then act…

c6d179e

…ual max score Signed-off-by: Martin Gaievski <[email protected]>

owaiskazi19 reviewed Feb 28, 2025

View reviewed changes

owaiskazi19 approved these changes Feb 28, 2025

View reviewed changes

martin-gaievski merged commit 3a6eab3 into opensearch-project:feature/lower_bounds_for_minmax_normalization Feb 28, 2025
51 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature branch] Lower bounds for min-max normalization in hybrid query #1195

[Feature branch] Lower bounds for min-max normalization in hybrid query #1195

martin-gaievski commented Feb 24, 2025 •

edited

Loading

codecov bot commented Feb 25, 2025 •

edited

Loading

martin-gaievski Feb 25, 2025 •

edited

Loading

vibrantvarun left a comment

vibrantvarun Feb 26, 2025

martin-gaievski Feb 27, 2025

vibrantvarun left a comment

owaiskazi19 left a comment

owaiskazi19 Feb 27, 2025

martin-gaievski Feb 27, 2025

owaiskazi19 Feb 28, 2025

martin-gaievski Feb 28, 2025

owaiskazi19 Feb 27, 2025

martin-gaievski Feb 27, 2025

owaiskazi19 Feb 28, 2025

martin-gaievski Feb 28, 2025

owaiskazi19 Feb 28, 2025

owaiskazi19 Feb 27, 2025

martin-gaievski Feb 27, 2025

junqiu-lei left a comment

owaiskazi19 left a comment

[Feature branch] Lower bounds for min-max normalization in hybrid query #1195

[Feature branch] Lower bounds for min-max normalization in hybrid query #1195

Conversation

martin-gaievski commented Feb 24, 2025 • edited Loading

Description

Related Issues

Check List

codecov bot commented Feb 25, 2025 • edited Loading

Codecov Report

martin-gaievski Feb 25, 2025 • edited Loading

Choose a reason for hiding this comment

vibrantvarun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vibrantvarun left a comment

Choose a reason for hiding this comment

owaiskazi19 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

junqiu-lei left a comment

Choose a reason for hiding this comment

owaiskazi19 left a comment

Choose a reason for hiding this comment

martin-gaievski commented Feb 24, 2025 •

edited

Loading

codecov bot commented Feb 25, 2025 •

edited

Loading

martin-gaievski Feb 25, 2025 •

edited

Loading