Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Enhance Normalization Processor with Sequential Normalization Techniques #1179

Open
owaiskazi19 opened this issue Feb 7, 2025 · 0 comments

Comments

@owaiskazi19
Copy link
Member

owaiskazi19 commented Feb 7, 2025

Is your feature request related to a problem?

In our hybrid search system, we currently employ a single normalization technique within the normalization processor to standardize data. The existing implementation allows users to specify one normalization technique and one combination technique, as shown in the following example:

{
  "description": "Post-processor for hybrid search",
  "phase_results_processors": [
    {
      "normalization-processor": {
        "normalization": {
          "technique": "min_max"
        },
        "combination": {
          "technique": "arithmetic_mean"
        }
      }
    }
  ]
}

While this approach has served us well, it may not be sufficient for more complex normalization requirements in certain use cases like

Use Case:

E-commerce Product Search with Diverse Attributes
Scenario: Consider an e-commerce platform that sells a wide variety of products, from electronics to clothing to home goods. The search system needs to handle diverse product attributes and provide relevant results across different categories.
Problem: Different product attributes have vastly different scales and distributions:

Price: Ranges from a few dollars to thousands of dollars
User Ratings: Typically on a scale of 1 to 5 stars
Number of Reviews: Can range from 0 to millions
Product Age: Measured in days since the product was listed
Sales Rank: A number indicating popularity, lower is better

Challenge: Using a single normalization technique doesn't adequately address the diverse nature of these attributes, leading to suboptimal search results.

Solution using Sequential Multi-Technique Normalization:

{
  "normalization-processor": [
    {
      "normalization": {
        "technique": "min_max"
      },
      "combination": {
        "technique": "arithmetic_mean"
      }
    },
    {
      "normalization": {
        "technique": "z_score"
      },
      "combination": {
        "technique": "geometric_mean"
      }
    }
  ]
}

Step-by-step process:

* Min-Max Normalization:
    Brings all attributes to a 0-1 scale
    Helps in initial comparison across different scales

* Z-Score Normalization:
    Applied to the result of the previous step
    Accounts for the distribution of scores across products
    Helps identify how exceptional a product is compared to others

Benefits of this approach:

Handling Outliers: The initial min-max normalization prevents extreme values (like very high-priced items) from dominating, while the subsequent z-score normalization accounts for the distribution of scores.

Balancing Different Scales: It effectively handles attributes with vastly different scales (e.g., price vs. star rating).

Improved Relevance: By applying different normalization and combination techniques sequentially, the system can provide more nuanced and relevant search results.

Flexibility: This approach allows for fine-tuning the search algorithm without changing the underlying data or search implementation.

Example Outcome: A user searching for "high-quality camera" might get results that balance high user ratings, a large number of reviews, competitive pricing, and recent release dates, even though these attributes are on very different scales originally.

What solution would you like?

To provide more sophisticated and flexible data normalization capabilities, we can think of sequential multi-technique normalization in the processors. This enhancement would allow users to specify multiple normalization and combination techniques that would be applied in sequence.

Here's a proposed structure for this enhanced normalization processor:

{
  "description": "Post-processor for hybrid search with sequential normalization",
  "phase_results_processors": [
    {
      "normalization-processor": [
        {
          "normalization": {
            "technique": "min_max"
          },
          "combination": {
            "technique": "arithmetic_mean"
          }
        },
        {
          "normalization": {
            "technique": "l2"
          },
          "combination": {
            "technique": "geometric_mean"
          }
        }
      ]
    }
  ]
}

In this example, the data would first undergo min-max normalization followed by arithmetic mean combination, and then the results would be further normalized using L2 normalization followed by geometric mean combination.

What alternatives have you considered?

A clear and concise description of any alternative solutions or features you've considered.

Do you have any additional context?

Add any other context or screenshots about the feature request here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant