Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce ranking for short (Slack) short chunks w/ little information content #4098

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

joachim-danswer
Copy link
Contributor

Description

We are addressing the issue of short chunks - particularly Slack messages - ranking very high in various keyword-centric searches.

The approach consists of:

  1. Remove the 'title=""' in the Slack Document definition which then defaults to the semantic identifier
  2. Introduce an Information Content Model that attempts to evaluate the information content for chunks of length < 10 words.

For 2), specifically, we introduce a new vespa attribute 'aggregated_boost_factor' which defaults to 1 and represents initially the impact of the content identification. (Over time, more factors will go into this boost factor.)

  • during indexing:
    if new parameter 'USE_CONTENT_CLASSIFICATION' is true (default)

    • run the model on all chunks of size < 10 words. The model is a SetFit classification model with temperature (new env parameter)
    • use the resulting score and new env parameters INDEXING_CONTENT_CLASSIFICATION_MIN/INDEXING_CONTENT_CLASSIFICATION_MAX to calculate the aggregated_boost_factor.
    • store the result in vespa
  • during querying:

    • 'aggregated_boost_factor' is used as a multiplier in the ranking

How Has This Been Tested?

So far not fully tested as the model has not been uploaded yet. Tests were conducted using dummy outputs.
REQUIRES FULL TESTING

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

Copy link

vercel bot commented Feb 23, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
internal-search ✅ Ready (Inspect) Visit Preview 💬 Add feedback Feb 23, 2025 10:05pm

Copy link
Contributor

@evan-danswer evan-danswer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will wait on approval until E2E tests are done, hope the comments aren't too onerous in the meantime! :)

_INFORMATION_CONTENT_MODEL: SetFitModel | None = None

_INFORMATION_CONTENT_MODEL_PROMPT_PREFIX: str = (
"Does this sentence have very specific information: " # spec to model version!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the comment mean?

) # TODO: add version once we have proper model

information_content_model = get_local_information_content_model()
information_content_model.device
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this line?

if prob < 0.25:
raw_score = 0.0
elif prob < 0.75:
raw_score = min(1.0, (prob - 0.25) / 0.5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the min() necessary? Is python floating arithmetic that bad :'(

else:
raw_score = 1.0
return (
INDEXING_INFORMATION_CONTENT_CLASSIFICATION_MIN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if just INFORMATION_CONTENT_MIN is descriptive enough? If you agree, might make it easier to read

]

# output_classes = [1] * len(text_inputs)
# output_scores = [0.9] * len(text_inputs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove or comment why they're useful to have

failures: list[ConnectorFailure] = []

for chunk in chunks:
if len(chunk.content.split()) <= 10:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the constant here to replace the 10

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer having the short part of the if-else come first with a continue at the end, this lets you unindent the rest of the body. i.e. in this case

# SHORT_CHUNK_THRESH defined as 10 in the constants file
if len(chunk.content.split()) > SHORT_CHUNK_THRESH:
    chunk_content_scores.append(1.0)
    chunks_with_scores.append(chunk)
    continue

try:
    chunk_content_scores.append(
...

)[0][1]
)
chunks_with_scores.append(chunk)
except Exception as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer excepting more constrained error types, but if it's really justified...

logger.exception(
f"Error predicting content classification for chunk: {e}. Adding to missed content classifications."
)
# chunk_content_scores.append(1.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete this

def predict(
self,
queries: list[str],
) -> list[tuple[int, float]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible it would be nice to use a BaseModel here instead of tuple[int, float].

title_embedding=list(model.encode(f"search_document: {overview_title}")),
content_embedding=list(
model.encode(f"search_document: {overview_title}\n{overview}")
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you have to modify document seeding?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants