-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce ranking for short (Slack) short chunks w/ little information content #4098
base: main
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will wait on approval until E2E tests are done, hope the comments aren't too onerous in the meantime! :)
_INFORMATION_CONTENT_MODEL: SetFitModel | None = None | ||
|
||
_INFORMATION_CONTENT_MODEL_PROMPT_PREFIX: str = ( | ||
"Does this sentence have very specific information: " # spec to model version! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the comment mean?
) # TODO: add version once we have proper model | ||
|
||
information_content_model = get_local_information_content_model() | ||
information_content_model.device |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this line?
if prob < 0.25: | ||
raw_score = 0.0 | ||
elif prob < 0.75: | ||
raw_score = min(1.0, (prob - 0.25) / 0.5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the min() necessary? Is python floating arithmetic that bad :'(
else: | ||
raw_score = 1.0 | ||
return ( | ||
INDEXING_INFORMATION_CONTENT_CLASSIFICATION_MIN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if just INFORMATION_CONTENT_MIN
is descriptive enough? If you agree, might make it easier to read
] | ||
|
||
# output_classes = [1] * len(text_inputs) | ||
# output_scores = [0.9] * len(text_inputs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove or comment why they're useful to have
failures: list[ConnectorFailure] = [] | ||
|
||
for chunk in chunks: | ||
if len(chunk.content.split()) <= 10: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the constant here to replace the 10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer having the short part of the if-else come first with a continue
at the end, this lets you unindent the rest of the body. i.e. in this case
# SHORT_CHUNK_THRESH defined as 10 in the constants file
if len(chunk.content.split()) > SHORT_CHUNK_THRESH:
chunk_content_scores.append(1.0)
chunks_with_scores.append(chunk)
continue
try:
chunk_content_scores.append(
...
)[0][1] | ||
) | ||
chunks_with_scores.append(chunk) | ||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would prefer excepting more constrained error types, but if it's really justified...
logger.exception( | ||
f"Error predicting content classification for chunk: {e}. Adding to missed content classifications." | ||
) | ||
# chunk_content_scores.append(1.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete this
def predict( | ||
self, | ||
queries: list[str], | ||
) -> list[tuple[int, float]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible it would be nice to use a BaseModel here instead of tuple[int, float].
title_embedding=list(model.encode(f"search_document: {overview_title}")), | ||
content_embedding=list( | ||
model.encode(f"search_document: {overview_title}\n{overview}") | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you have to modify document seeding?
Description
We are addressing the issue of short chunks - particularly Slack messages - ranking very high in various keyword-centric searches.
The approach consists of:
For 2), specifically, we introduce a new vespa attribute 'aggregated_boost_factor' which defaults to 1 and represents initially the impact of the content identification. (Over time, more factors will go into this boost factor.)
during indexing:
if new parameter 'USE_CONTENT_CLASSIFICATION' is true (default)
during querying:
How Has This Been Tested?
So far not fully tested as the model has not been uploaded yet. Tests were conducted using dummy outputs.
REQUIRES FULL TESTING
Backporting (check the box to trigger backport action)
Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.