New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Reduce ranking for short (Slack) short chunks w/ little information content #4098

Open

joachim-danswer wants to merge 7 commits into main from content-model

Contributor

joachim-danswer commented Feb 23, 2025

Description

We are addressing the issue of short chunks - particularly Slack messages - ranking very high in various keyword-centric searches.

The approach consists of:

Remove the 'title=""' in the Slack Document definition which then defaults to the semantic identifier
Introduce an Information Content Model that attempts to evaluate the information content for chunks of length < 10 words.

For 2), specifically, we introduce a new vespa attribute 'aggregated_boost_factor' which defaults to 1 and represents initially the impact of the content identification. (Over time, more factors will go into this boost factor.)

during indexing:
if new parameter 'USE_CONTENT_CLASSIFICATION' is true (default)
- run the model on all chunks of size < 10 words. The model is a SetFit classification model with temperature (new env parameter)
- use the resulting score and new env parameters INDEXING_CONTENT_CLASSIFICATION_MIN/INDEXING_CONTENT_CLASSIFICATION_MAX to calculate the aggregated_boost_factor.
- store the result in vespa
during querying:
- 'aggregated_boost_factor' is used as a multiplier in the ranking

How Has This Been Tested?

So far not fully tested as the model has not been uploaded yet. Tests were conducted using dummy outputs.
REQUIRES FULL TESTING

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

This PR should be backported (make sure to check that the backport attempt succeeds)
[Optional] Override Linear Check

joachim-danswer added 5 commits

February 18, 2025 16:54


          remove title for slack

aa05678


          initial working code

d25fecc


          simplification

2cfe9aa


          improvements

8c334b2


          name change to information_content_model

7bf8a17

vercel bot commented Feb 23, 2025 •

edited

Loading

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
internal-search	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Feb 23, 2025 10:05pm


          avoid boost_score > 1.0

4bf5ae5

vercel bot deployed to Preview

February 23, 2025 16:32

View deployment

nit

3a2bc1e

vercel bot deployed to Preview

February 23, 2025 22:05

View deployment

evan-danswer reviewed

View reviewed changes

Contributor

evan-danswer left a comment

Will wait on approval until E2E tests are done, hope the comments aren't too onerous in the meantime! :)

backend/model_server/custom_models.py

+              _INFORMATION_CONTENT_MODEL: SetFitModel | None = None
+              _INFORMATION_CONTENT_MODEL_PROMPT_PREFIX: str = (
+                  "Does this sentence have very specific information: "  # spec to model version!

Contributor

evan-danswer Feb 24, 2025

What does the comment mean?

backend/model_server/custom_models.py

+                  )  # TODO: add version once we have proper model
+                  information_content_model = get_local_information_content_model()
+                  information_content_model.device

Contributor

evan-danswer Feb 24, 2025

remove this line?

backend/model_server/custom_models.py

+                      if prob < 0.25:
+                          raw_score = 0.0
+                      elif prob < 0.75:
+                          raw_score = min(1.0, (prob - 0.25) / 0.5)

Contributor

evan-danswer Feb 24, 2025

Is the min() necessary? Is python floating arithmetic that bad :'(

backend/model_server/custom_models.py

+                      else:
+                          raw_score = 1.0
+                      return (
+                          INDEXING_INFORMATION_CONTENT_CLASSIFICATION_MIN

Contributor

evan-danswer Feb 24, 2025

I wonder if just INFORMATION_CONTENT_MIN is descriptive enough? If you agree, might make it easier to read

backend/model_server/custom_models.py

+                  ]
+                  # output_classes = [1] * len(text_inputs)
+                  # output_scores = [0.9] * len(text_inputs)

Contributor

evan-danswer Feb 24, 2025

remove or comment why they're useful to have

backend/onyx/indexing/indexing_pipeline.py

+                      failures: list[ConnectorFailure] = []
+                      for chunk in chunks:
+                          if len(chunk.content.split()) <= 10:

Contributor

evan-danswer Feb 24, 2025

use the constant here to replace the 10

Contributor

evan-danswer Feb 24, 2025

I prefer having the short part of the if-else come first with a continue at the end, this lets you unindent the rest of the body. i.e. in this case

# SHORT_CHUNK_THRESH defined as 10 in the constants file
if len(chunk.content.split()) > SHORT_CHUNK_THRESH:
    chunk_content_scores.append(1.0)
    chunks_with_scores.append(chunk)
    continue

try:
    chunk_content_scores.append(
...

backend/onyx/indexing/indexing_pipeline.py

+                                      )[0][1]
+                                  )
+                                  chunks_with_scores.append(chunk)
+                              except Exception as e:

Contributor

evan-danswer Feb 24, 2025

Would prefer excepting more constrained error types, but if it's really justified...

backend/onyx/indexing/indexing_pipeline.py

+                                  logger.exception(
+                                      f"Error predicting content classification for chunk: {e}. Adding to missed content classifications."
+                                  )
+                                  # chunk_content_scores.append(1.0)

Contributor

evan-danswer Feb 24, 2025

delete this

backend/onyx/natural_language_processing/search_nlp_models.py

+                  def predict(
+                      self,
+                      queries: list[str],
+                  ) -> list[tuple[int, float]]:

Contributor

evan-danswer Feb 24, 2025

If possible it would be nice to use a BaseModel here instead of tuple[int, float].

backend/scripts/document_seeding_prep.py

    
                  title_embedding=list(model.encode(f"search_document: {overview_title}")),

                  content_embedding=list(

                      model.encode(f"search_document: {overview_title}\n{overview}")

                  ),

Contributor

evan-danswer Feb 24, 2025

Why do you have to modify document seeding?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet