You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This document describes details of design for Explainability in Hybrid Query. This feature has been requested through GitHub issues #150 and #299.
Overview
Hybrid search combines multiple query types, like keyword and neural search, to improve search relevance. In version 2.11 team released hybrid query which is part of the neural-search plugin. Main responsibility of the hybrid query is to return combined scores of multiple queries. In common scenario those queries represent different search types, like lexical and semantic based.
Hybrid query uses multiple techniques for preparing final list of matched document, main two types are score based normalization and rank base combination. For score base normalization, the most effective technique is min-max normalization. In scope of this proposal we want to improve search relevance of min-max normalization by allowing setting of lower bound.
Problem Statement
Min-max normalization technique is based on the usage of maximum and minimum scores from all matched documents with the following formula.
In context of OpenSearch, finding the minimum score is based on assumption that may be not the most effective one. While handling search request, system retrieves limited amount of matching documents from each shard, this limit is defined by the query parameter size. Minimum score will be identified as minimum from all score from all collected documents. In case overall number of matching documents is much higher then number of retrieved documents, then the delta between real and retrieved minimum scores can be significant. This will negatively influence the final normalized score.
Following graphs illustrate described scenario, in shard 1 retrieved min score is 4.0, while actual lower bound is 0.0. Similarly for shard 2 retrieved and lower bound scores are 2.0 and 1.0.
Requirements
Functional Requirements
We want to introduce a lower limit that reflects the actual minimum score for matching document.
Functional Requirements
User defines the lower bound score for each sub-query. It should be possible to have mixed configuration, where some sub-queries use the lower bound and some still use today approach.
Lower bound score applies to existing min-max normalization technique in hybrid query/normalization processor. For other techniques we should block this feature.
All changes in interface should be minimal, backward compatible and aligned with existing conventions. New change should not replace the default behavior.
Should support/not collide with existing hybrid query features, e.g. pagination, explain etc.
Non functional requirements
Minimal regression in a performance of hybrid query, not more then 2% latency and 2% in RAM consumption for coordinator role node
Current state
Today normalization in hybrid query is performed by the normalization processor which is of phase results processor type. Search pipeline with this processor can be defined with following request, see more details of how normalization processor can be configured here
where min score is min(scores[0 ... size)) and max score is max(scores[0 ... size))
Min-max technique uses all matched documents from all shards to find maximum and minimum scores. OpenSearch retrieve up to size number of documents from each shard, for most scenarios they will be sorted in desc order, documents with lower scores will be dropped. While the maximum score will be used by the processor is one of the retrieved documents, real minimum score can be outside of the retrieved sub-set of documents.
Following table shows one example of such scenario when size for the query is set to 3, and there are more than size matching documents in each shard
Query params
Nodes
DocIds
BM-25 scores
K-NN
BM25 collected docIds/scores
K-NN collected docIds/scores
size = 5
Data Node-1
d1
30
1.5
d5 - 80
d3 - 5
d2
25
2.5
d1 - 20
d5 - 3
d3
5
d2 - 10
d2 - 2.5
d4
1
d1 - 1.5
d5
80
3
d4 - 1
Data Node-2
d6
2
d10 - 100
d8 - 4.2
d7
70
1.2
d7 - 70
d9 - 3.3
d8
4.2
d10 - 2.7
d9
3.3
d6 - 2
d10
100
2.7
d7 - 1.2
Coordinator Node
Global BM-25
Global Norm BM-25
Global KNN
Global Norm KNN
Global combined results
Global combined sorted results
d10 - 100
1.00
d3 - 5
1.000
d1: 0.1175
d10: 0.7125
d5 - 80
0.73
d8 - 4.2
0.800
d2: 0.1875
d5: 0.6150
d7 - 70
0.60
d9 - 3.3
0.575
d3: 0.50
d3: 0.5
d1 - 30
0.07
d5 - 3
0.500
d4: 0.0
d8: 0.4
d2 - 25
0.00
d10 - 2.7
0.425
d5: 0.64
d7: 0.3250
d2 - 2.5
0.375
d6: 0.125
d9: 0.2875
d6 - 2.0
0.250
d7: 0.36
d2: 0.1875
d1 - 1.5
0.125
d8: 0.40
d6: 0.125
d7 - 1.2
0.050
d9: 0.2875
d1: 0.0975
d4 - 1.0
0.000
d10: 0.7125
d4: 0.0
min/max 25/100
min/max 1.0/5
Two major issues are:
documents with min score (e.g. d4) are dropped from the final result despite them having non-zero score. In relaxed version of the same problem document receives low score in case it matches multiple different sub-queries (e.g. d2)
normalized score does not reflect the scale of actual difference in scores between min and max scores. In the example table for BM25 the min score is 1/4 of the max score, and for KNN it’s 1/5. After normalization they both will be at the same scale of 0.0 to 1.0. For user such behavior may be counterintuitive because min score document in KNN results is “closer“ to top document comparing to min score document in BM25 results.
Challenges
Retrieve actual lower bound score for the query
This is a problem when number of matched documents at shard level is greater then the size . In this case actual min score will not be the part of the document set, and in case if matched document number is higher then the max size limit it will no be even collected. Example is knn query, where for typical configuration any document will have some positive score.
We can perform exhaustive search and retrieve all matching documents at shard level. Problem with this approach is performance degradation, for big datasets latency can drop drastically, and memory consumption can be high. Based on these consideration we recommend to avoid exhaustive retrieval.
How to deal with documents that have score lower then the lower bound
If we use any lower bound value that is not based on actual data that can lead to scenario when document may have scores that are lower then the lower bound.
There are multiple way of howe to address this, we can:
use lower bound for all score that are lower
drop such documents
keep the document scores but penalize them using certain techniques like decay
Solution Overview
Implementation of the lower bound score in the context of calculations is straightforward, with our change calculation will look like this:
Essentially we replaced the actual minScore by the user provided number. Value for minimum score can be set as new parameter for normalization technique. We can use format that is based on the position (index) of sub-query, similar to existing weights parameter.
These changes will be implemented at the phase result processor level. This component is responsible for running computations when the min-max technique is set up by the user. The following diagram shows the high-level components and the specific location where the change needs to be made.
To avoid confusion with existing OpenSearch feature with the same name min_score I suggest we pick different name.
Recommended name is lower_bound.
Expert level configuration for lower bound
The configuration of the lower bound for the min-max normalization technique is considered an expert-level task. This feature is part of the search pipeline setup and should be used with caution, as improper configuration can lead to less relevant hybrid scores. Determining the optimal lower bound value requires a deep understanding of the data distribution and the specific search use case. It involves analyzing the score distribution and experimenting with different values to find the most effective lower bound. Incorrect configuration can significantly impact the relevance of the search results, and the computation of the lower bound may introduce additional latency and resource consumption. Users should be aware of these potential impacts and monitor their systems accordingly.
To give users maximum flexibility we are going to allow user to configure lower_bounds at sub-query level, and skip/don’t apply it if needed.
Option1: Configurable score retention or clipping [Recommended]
Pros:
better relevancy metrics (NDCG)
simple and efficient in terms of resource utilization
Cons:
more new parameters in processor definition
It’s possible that actual shard level scores are lower then the lower bound score we defined. In this case we can do one of following actions:
return actual score
drop the actual score and return score defined as lower bound (vanilla clipping)
Following graphs illustrate ho the lower bound works when:
all scores are greater then lower bound
some scores are lower then lower bound, and how the clipping can be applied
without penalizing scores
with score penalization
I have done POC and collected NDCG metric values for several datasets, following table shows these results
Summary of experiments
dataset
baseline
NDCG@5
NDCG@10
NDCG@100
trec-covid
0.6602
0.6002
0.3427
nfcorpus
0.3627
0.3285
0.2946
arguana
0.4504
0.4987
0.4683
fiqa
0.2645
0.292
0.3565
min score with clipping
NDCG@5
NDCG@10
NDCG@100
trec-covid
0.7279
0.6727
0.4876
nfcorpus
0.365
0.329
0.2881
arguana
0.4306
0.4801
0.5274
fiqa
0.2519
0.2726
0.3292
0.0677
0.0725
0.1449
0.0023
0.0005
-0.0065
-0.0198
-0.0186
0.0591
-0.0126
-0.0194
-0.0273
AVG
0.02023
min score with fixed decay penalty
NDCG@5
NDCG@10
NDCG@100
trec-covid
0.7372
0.6798
0.4916
nfcorpus
0.348
0.3142
0.2836
arguana
0.441
0.4877
0.5323
fiqa
0.2547
0.2752
0.3336
0.077
0.0796
0.1489
-0.0147
-0.0143
-0.011
-0.0094
-0.011
0.064
-0.0098
-0.0168
-0.0229
AVG
0.02163
min score with adaptive decay penalty
NDCG@5
NDCG@10
NDCG@100
trec-covid
0.7408
0.6922
0.5052
nfcorpus
0.3653
0.3318
0.2965
arguana
0.4523
0.5004
0.544
fiqa
0.271
0.2946
0.3573
0.0806
0.092
0.1625
0.0026
0.0033
0.0019
0.0019
0.0017
0.0757
0.0065
0.0026
0.0008
AVG
0.03601
min score with unaltered score retention
NDCG@5
NDCG@10
NDCG@100
trec-covid
0.7405
0.693
0.5054
nfcorpus
0.3674
0.3346
0.297
arguana
0.4533
0.501
0.5451
fiqa
0.271
0.2946
0.3573
0.0803
0.0928
0.1627
0.0047
0.0061
0.0024
0.0029
0.0023
0.0768
0.0065
0.0026
0.0008
AVG
0.03674
See appendix for detailed dataset statistics.
Based on the data from POC we recommend solution that uses lower bound for scores greater then the min_score, and uses actual score when it’s lower then the lower bound. Recommendation is to make this approach default, and clipping mode will be optional.
Solution with decay function that is based on IQR giving similar results, but is more computationally intense, os it’s not recommended.
API changes
New feature needs to be configurable by user. That can be done via technique parameters for the processor as part of search pipeline definition.
Structure that holds lower bound details for each sub-query. Number of items must match the number of sub-queries.
empty array
mode
string
Controls how the lower bound should be applied to sub-queryPossible values are: apply - use min_score for normaliation but do not replace original scoresclip - replace scores below lower bound with the min_score valueignore - do not apply lower_bound at all for this sub-query
apply
min_score
float
Sets the actual value for the score lower bound. Allowed values are from -10,000 to 10,000
0.0
for completeness following request shows processor with all defaults, meaning we will apply lower bound with a min_score of 0.0 for all sub-queries
simple, less changes in interface and less learning curve for users (but they still need to provide the value for lower bound score)
less compute intense (need to run benchmarks for exact numbers, presumable within the 1-2%)
Cons:
lower relevancy metrics (NDCG) because it ignores actual data distribution
Option 3: Clipping with IQR based decay
We can address low scores by keeping them but also applying penalty to low scores that are lower then “lower bound”.
There are many options for applying penalty to the low scores, some of the most popular and promising are:
fixed rate of decay
pros:
simple and computationally efficient
cons:
ignores the actual scores distribution
based on standard deviation
pros:
depends on actual scores distribution
cons:
sensitive to outliers
extra compute
based on interquartile range (IQR)
pros:
depends on actual scores distribution, works well with skewed distributions
works well with skewed distributions, robust to outliers and unusual distributions
intuitive (represents the middle 50% of the data)
cons:
extra compute
based on median absolute deviation (MAD)
pros:
depends on actual scores distribution
robust to outliers and unusual distributions
cons:
more computationally intense comparing to IQR
I have done POC and collected NDCG metric values for several datasets, following table shows these results, check table for exact numbers. Most promising is approach that’s based on IQR.
Pros:
good relevancy metrics (NDCG), takes into account data distribution
Cons:
more compute intense (need to run benchmarks for exact numbers, presumable within the 1-2%)
less intuitive for customers, can set higher barrier for feature adoption
User scenarios
Following data show how every solution option affects the final score. Initial score values are taken from this table that shows how score are calculated today
Lower bound is lower then the actual minimal score
Lower bound [0.0, 00]
Coordinator Node
Global BM-25
Global Norm BM-25
Global KNN
Global Norm KNN
Global combined results
Global combined sorted results
results with no lower bound (for reference)
d10 - 100
1.00
d3 - 5
1.00
d1: 0.3
d10: 0.77
d10: 0.7125
d5 - 80
0.80
d8 - 4.2
0.84
d2: 0.375
d5: 0.7
d5: 0.6150
d7 - 70
0.70
d9 - 3.3
0.66
d3: 0.5
d3: 0.5
d3: 0.5
d1 - 30
0.30
d5 - 3
0.60
d4: 0.1
d7: 0.47
d8: 0.4
d2 - 25
0.25
d10 - 2.7
0.54
d5: 0.7
d8: 0.42
d7: 0.3250
d2 - 2.5
0.50
d6: 0.2
d2: 0.375
d9: 0.2875
d6 - 2.0
0.40
d7: 0.47
d9: 0.33
d2: 0.1875
d1 - 1.5
0.30
d8: 0.42
d1: 0.3
d6: 0.125
d7 - 1.2
0.24
d9: 0.33
d6: 0.2
d1: 0.0975
d4 - 1.0
0.20
d10: 0.77
d4: 0.1
d4: 0.0
min/max 0/100
min/max 0.0/5
Lower bound is higher then min score, clipping enabled
Lower bound [30.0, 2.0]
Coordinator Node
Global BM-25
Global Norm BM-25
Global KNN
Global Norm KNN
Global combined results
Global combined sorted results
results with no lower bound (for reference)
d10 - 100
1.00
d3 - 5
1.00
d1: 0.0
d10: 0.6165
d10: 0.7125
d5 - 80
0.71
d8 - 4.2
0.73
d2: 0.0835
d5: 0.5215
d5: 0.6150
d7 - 70
0.57
d9 - 3.3
0.43
d3: 0.5
d3: 0.5
d3: 0.5
d1 - 30
0.00
d5 - 3
0.33
d4: 0.0
d8: 0.3665
d8: 0.4
d2 - 25
0.00
d10 - 2.7
0.23
d5: 0.5215
d7: 0.2850
d7: 0.3250
d2 - 2.5
0.17
d6: 0.0
d9: 0.2165
d9: 0.2875
d6 - 2.0
0.00
d7: 0.2850
d2: 0.0835
d2: 0.1875
d1 - 1.5
0.00
d8: 0.3665
d1: 0.0
d6: 0.125
d7 - 1.2
0.00
d9: 0.2165
d6: 0.0
d1: 0.0975
d4 - 1.0
0.00
d10: 0.6165
d4: 0.0
d4: 0.0
min/max 30/100
min/max 2.0/5
Lower bound is higher then min score, penalize score with decay function
Lower bound [30.0, 2.0]
decay rate based on standard deviation
Coordinator Node
Global BM-25
Global Norm BM-25
Global KNN
Global Norm KNN
Global combined results
Global combined sorted results
results with no lower bound (for reference)
d10 - 100
1.0
d3 - 5
1.00
d1: 0.2870
d10: 0.77
d10: 0.7125
d5 - 80
0.8
d8 - 4.2
0.84
d2: 0.3732
d5: 0.7
d5: 0.6150
d7 - 70
0.7
d9 - 3.3
0.66
d3: 0.5
d3: 0.5
d3: 0.5
d1 - 30
0.3
d5 - 3
0.60
d4: 0.0
d7: 0.451
d8: 0.4
d2 - 25
0.246
d10 - 2.7
0.54
d5: 0.5215
d8: 0.42
d7: 0.3250
d2 - 2.5
0.50
d6: 0.0
d2: 0.3732
d9: 0.2875
d6 - 2.0
0.40
d7: 0.2850
d9: 0.33
d2: 0.1875
d1 - 1.5
0.274
d8: 0.3665
d1: 0.287
d6: 0.125
d7 - 1.2
0.202
d9: 0.2165
d6: 0.2
d1: 0.0975
d4 - 1.0
0.154
d10: 0.6165
d4: 0.077
d4: 0.0
standard deviation = 32.68
min/max 30/100
standard deviation = 1.31
min/max 2.0/5
Decay rate based on interquartile range (IQR)
Coordinator Node
Global BM-25
Global Norm BM-25
Global KNN
Global Norm KNN
Global combined results
Global combined sorted results
results with no lower bound (for reference)
d10 - 100
1.0
d3 - 5
1.00
d1: 0.2890
d10: 0.77
d10: 0.7125
d5 - 80
0.8
d8 - 4.2
0.84
d2: 0.3738
d5: 0.7
d5: 0.6150
d7 - 70
0.7
d9 - 3.3
0.66
d3: 0.5
d3: 0.5
d3: 0.5
d1 - 30
0.3
d5 - 3
0.6
d4: 0.083
d7: 0.455
d8: 0.4
d2 - 25
0.2476
d10 - 2.7
0.54
d5: 0.7
d8: 0.42
d7: 0.3250
d2 - 2.5
0.5
d6: 0.2
d2: 0.3738
d9: 0.2875
d6 - 2.0
0.4
d7: 0.455
d9: 0.33
d2: 0.1875
d1 - 1.5
0.278
d8: 0.42
d1: 0.289
d6: 0.125
d7 - 1.2
0.21
d9: 0.33
d6: 0.2
d1: 0.0975
d4 - 1.0
0.166
d10: 0.77
d4: 0.083
d4: 0.0
standard deviation = 32.68
min/max 30/100
standard deviation = 1.31
min/max 2.0/5
Low Level Design
How to setup
We need minor adjustments in the Factory class to read and parse new parameters ScoreNormalizationFactory.
if lower_bounds not present ignore lower bound logic completed (today behavior)
if lower_bounds present, then read mode, apply default if neede; read min_score , if not there use 0.0 default value.
First we compute the minimum score depending on mode flag and min_score limit value
private float[] getMinScores(final List<CompoundTopDocs> queryTopDocs, final int numOfScores) {
float[] minScores = new float[numOfScores];
Arrays.fill(minScores, Float.MAX_VALUE);
for (CompoundTopDocs compoundQueryTopDocs : queryTopDocs) {
if (Objects.isNull(compoundQueryTopDocs)) {
continue;
}
List<TopDocs> topDocsPerSubQuery = compoundQueryTopDocs.getTopDocs();
for (int j = 0; j < topDocsPerSubQuery.size(); j++) {
if (applyLowerBounds) { // added logic we take min score from processor definition
minScores[j] = lowerBoundMinScores.get(j);
}
else { // logic we have today
minScores[j] = Math.min(
minScores[j],
Arrays.stream(topDocsPerSubQuery.get(j).scoreDocs)
.map(scoreDoc -> scoreDoc.score)
.min(Float::compare)
.orElse(Float.MAX_VALUE)
);
}
}
}
return minScores;
}
Changes for a single score normalization should be done in the normalizeSingleScore method
private float normalizeSingleScore(final float score, final float minScore, final float maxScore, LowerBound lowerBoundDTO) {
if (Floats.compare(maxScore, minScore) == 0 && Floats.compare(maxScore, score) == 0) {
return SINGLE_RESULT_SCORE;
}
if (!lowerBoundDTO.applyLowerBounds || lowerBoundDTO.mode == IGNORE) {
// this is logic we have today, no changes there
float normalizedScore = (score - minScore) / (maxScore - minScore);
return normalizedScore == 0.0f ? MIN_SCORE : normalizedScore;
}
float normalizedScore;
if (lowerBoundDTO.mode == APPLY && score < minScore) {
// if mode is apply then we return the actual document score
// in case of lower bounds it can be less then the min_score
normalizedScore = score;
} else if (lowerBoundDTO.mode == CLIP && score < minScore) {
// alternative approach when we clip the score so it became a min score
normalizedScore = minScore;
} else {
// this aplies to most of the cases when score is greater than the min score
normalizedScore = (score - minScore) / (maxScore - minScore);
}
return normalizedScore;
}
Potential Issues
Knowing the lower bound that gives the most relevant results can be a challenging to a user. Existing logic provides decent results in general, so this parameters should be an expert level setting rather then a default recommendation. We should think of some sort of heuristic to retrieve most effective lower bound from within the indexed data.
Metrics
Adding specific metric is not possible at the moment, we should add one once stats API for neural is ready. It’s in design phase #1104 and #1146. As per early reviews of stats API (draft design) adding new metric will be straightforward, as simple as making one call to the static method.
Backward Compatibility
New solution is backward compatible with today approach: if no details are specified for lower bounds then actual shard level min score will be used.
Testability
New functionality should be covered by unit tests and integration tests. Unit test will take care of computation logic and edge cases in input data. Integration test will test the end to end flow, on test should be enough for sanity check.
Need full scale benchmarking to measure how this feature affects the relevancy and resource utilization. Some benchmarks were done as part of the POC, using 4 data sets, average improvement of NDCG is 3.5%
We greatly value feedback from the community to ensure that this proposal addresses real-world use cases effectively. Here are a few specific points where your input would be particularly helpful:
Defaults for lower bounds
We plan to use defaults for the lower_bound feature, applying the lower bound score without a penalty and setting the default min_score to 0.0.
Are these defaults suitable for all query types?
Do you have any suggestions for alternative defaults?
Need for extra features
Should we consider adding extra features such as an upper_bound score?
What other features do you think would be beneficial?
Benefit of other techniques
Currently, we are adding the lower_bound feature to min-max normalization but not to L2 normalization.
Do you think it would be beneficial to add the lower_bound feature to L2 normalization as well?
Your insights will help us refine the proposal to better meet the needs of our users. Thank you for your valuable feedback!
The text was updated successfully, but these errors were encountered:
martin-gaievski
changed the title
[Draft][RFC] Lower bound for min-max normalization technique in Hybrid query
[RFC] Lower bound for min-max normalization technique in Hybrid query
Feb 18, 2025
Introduction
This document describes details of design for Explainability in Hybrid Query. This feature has been requested through GitHub issues #150 and #299.
Overview
Hybrid search combines multiple query types, like keyword and neural search, to improve search relevance. In version 2.11 team released hybrid query which is part of the neural-search plugin. Main responsibility of the hybrid query is to return combined scores of multiple queries. In common scenario those queries represent different search types, like lexical and semantic based.
Hybrid query uses multiple techniques for preparing final list of matched document, main two types are score based normalization and rank base combination. For score base normalization, the most effective technique is min-max normalization. In scope of this proposal we want to improve search relevance of min-max normalization by allowing setting of lower bound.
Problem Statement
Min-max normalization technique is based on the usage of maximum and minimum scores from all matched documents with the following formula.
normalizedScore = (score - minScore) / (maxScore - minScore);
In context of OpenSearch, finding the minimum score is based on assumption that may be not the most effective one. While handling search request, system retrieves limited amount of matching documents from each shard, this limit is defined by the query parameter size. Minimum score will be identified as minimum from all score from all collected documents. In case overall number of matching documents is much higher then number of retrieved documents, then the delta between real and retrieved minimum scores can be significant. This will negatively influence the final normalized score.
Following graphs illustrate described scenario, in shard 1 retrieved min score is 4.0, while actual lower bound is 0.0. Similarly for shard 2 retrieved and lower bound scores are 2.0 and 1.0.
Requirements
Functional Requirements
We want to introduce a lower limit that reflects the actual minimum score for matching document.
Functional Requirements
Non functional requirements
Current state
Today normalization in hybrid query is performed by the normalization processor which is of phase results processor type. Search pipeline with this processor can be defined with following request, see more details of how normalization processor can be configured here
Following formula used in min-max technique
normalizedScore = (score - minScore) / (maxScore - minScore);
where min score is min(scores[0 ... size)) and max score is max(scores[0 ... size))
Min-max technique uses all matched documents from all shards to find maximum and minimum scores. OpenSearch retrieve up to size number of documents from each shard, for most scenarios they will be sorted in desc order, documents with lower scores will be dropped. While the maximum score will be used by the processor is one of the retrieved documents, real minimum score can be outside of the retrieved sub-set of documents.
Following table shows one example of such scenario when size for the query is set to 3, and there are more than size matching documents in each shard
Two major issues are:
Challenges
Retrieve actual lower bound score for the query
This is a problem when number of matched documents at shard level is greater then the size . In this case actual min score will not be the part of the document set, and in case if matched document number is higher then the max size limit it will no be even collected. Example is knn query, where for typical configuration any document will have some positive score.
We can perform exhaustive search and retrieve all matching documents at shard level. Problem with this approach is performance degradation, for big datasets latency can drop drastically, and memory consumption can be high. Based on these consideration we recommend to avoid exhaustive retrieval.
How to deal with documents that have score lower then the lower bound
If we use any lower bound value that is not based on actual data that can lead to scenario when document may have scores that are lower then the lower bound.
There are multiple way of howe to address this, we can:
Solution Overview
Implementation of the lower bound score in the context of calculations is straightforward, with our change calculation will look like this:
float normalizedScore = (score - customMinScore) / (maxScore - customMinScore);
Essentially we replaced the actual minScore by the user provided number. Value for minimum score can be set as new parameter for normalization technique. We can use format that is based on the position (index) of sub-query, similar to existing weights parameter.
These changes will be implemented at the phase result processor level. This component is responsible for running computations when the min-max technique is set up by the user. The following diagram shows the high-level components and the specific location where the change needs to be made.
To avoid confusion with existing OpenSearch feature with the same name min_score I suggest we pick different name.
Recommended name is
lower_bound
.Expert level configuration for lower bound
The configuration of the lower bound for the min-max normalization technique is considered an expert-level task. This feature is part of the search pipeline setup and should be used with caution, as improper configuration can lead to less relevant hybrid scores. Determining the optimal lower bound value requires a deep understanding of the data distribution and the specific search use case. It involves analyzing the score distribution and experimenting with different values to find the most effective lower bound. Incorrect configuration can significantly impact the relevance of the search results, and the computation of the lower bound may introduce additional latency and resource consumption. Users should be aware of these potential impacts and monitor their systems accordingly.
To give users maximum flexibility we are going to allow user to configure lower_bounds at sub-query level, and skip/don’t apply it if needed.
Option1: Configurable score retention or clipping [Recommended]
Pros:
Cons:
It’s possible that actual shard level scores are lower then the lower bound score we defined. In this case we can do one of following actions:
Following graphs illustrate ho the lower bound works when:
I have done POC and collected NDCG metric values for several datasets, following table shows these results
Summary of experiments
See appendix for detailed dataset statistics.
Based on the data from POC we recommend solution that uses lower bound for scores greater then the min_score, and uses actual score when it’s lower then the lower bound. Recommendation is to make this approach default, and clipping mode will be optional.
Solution with decay function that is based on IQR giving similar results, but is more computationally intense, os it’s not recommended.
API changes
New feature needs to be configurable by user. That can be done via technique parameters for the processor as part of search pipeline definition.
Parameter details
for completeness following request shows processor with all defaults, meaning we will apply lower bound with a min_score of 0.0 for all sub-queries
Option 2: Clipping
We can just clip the low scores, meaning return the lower bound score if actual score is less then lower bound.
Request will look like following:
Pros:
Cons:
Option 3: Clipping with IQR based decay
We can address low scores by keeping them but also applying penalty to low scores that are lower then “lower bound”.
There are many options for applying penalty to the low scores, some of the most popular and promising are:
I have done POC and collected NDCG metric values for several datasets, following table shows these results, check table for exact numbers. Most promising is approach that’s based on IQR.
Pros:
Cons:
User scenarios
Following data show how every solution option affects the final score. Initial score values are taken from this table that shows how score are calculated today
Lower bound [0.0, 00]
Lower bound [30.0, 2.0]
Lower bound [30.0, 2.0]
decay rate based on standard deviation
Low Level Design
How to setup
We need minor adjustments in the Factory class to read and parse new parameters ScoreNormalizationFactory.
How compute normalized scores
All logic related changes will be done in MinMaxScoreNormalizationTechnique class.
First we compute the minimum score depending on
mode
flag and min_score limit valueChanges for a single score normalization should be done in the normalizeSingleScore method
Potential Issues
Knowing the lower bound that gives the most relevant results can be a challenging to a user. Existing logic provides decent results in general, so this parameters should be an expert level setting rather then a default recommendation. We should think of some sort of heuristic to retrieve most effective lower bound from within the indexed data.
Metrics
Adding specific metric is not possible at the moment, we should add one once stats API for neural is ready. It’s in design phase #1104 and #1146. As per early reviews of stats API (draft design) adding new metric will be straightforward, as simple as making one call to the static method.
Backward Compatibility
New solution is backward compatible with today approach: if no details are specified for lower bounds then actual shard level min score will be used.
Testability
New functionality should be covered by unit tests and integration tests. Unit test will take care of computation logic and edge cases in input data. Integration test will test the end to end flow, on test should be enough for sanity check.
Need full scale benchmarking to measure how this feature affects the relevancy and resource utilization. Some benchmarks were done as part of the POC, using 4 data sets, average improvement of NDCG is 3.5%
Appendix A
Dataset statistics
References
Feedback Required
We greatly value feedback from the community to ensure that this proposal addresses real-world use cases effectively. Here are a few specific points where your input would be particularly helpful:
Defaults for lower bounds
We plan to use defaults for the lower_bound feature, applying the lower bound score without a penalty and setting the default min_score to 0.0.
Are these defaults suitable for all query types?
Do you have any suggestions for alternative defaults?
Need for extra features
Should we consider adding extra features such as an upper_bound score?
What other features do you think would be beneficial?
Benefit of other techniques
Currently, we are adding the lower_bound feature to
min-max
normalization but not toL2
normalization.Do you think it would be beneficial to add the lower_bound feature to
L2
normalization as well?Your insights will help us refine the proposal to better meet the needs of our users. Thank you for your valuable feedback!
The text was updated successfully, but these errors were encountered: