Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Integrate NanoBeIR datasets #2966

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

ArthurCamara
Copy link
Contributor

As discussed in #2848 (comment), This PR adds a new Evaluator based on the NanoBEIR collection of datasets.

It creates one InformationRetrievalEvaluator for each dataset, and aggregates the results accordingly.

Example:

from sentence_transformers import SentenceTransformer
from sentence_transformers.evaluation import NanoBEIREvaluator

# Load a model
model = SentenceTransformer('all-mpnet-base-v2')

datasets = ["QuoraRetrieval", "MSMARCO"]
query_prompts = {
"QuoraRetrieval": "Instruct: Given a question, retrieve questions that are semantically equivalent to the given question\nQuery: ",
"MSMARCO": "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: "
}

evaluator = NanoBEIREvaluator(
dataset_names=datasets,
name="NanoBEIR",
query_prompts=query_prompts,
)

results = evaluator(model)
'''
NanoBEeIR Evaluation of the model on ['QuoraRetrieval', 'MSMARCO'] dataset:
Evaluating NanoBeIRNanoQuoraRetrieval
Evaluating NanoBeIRNanoMSMARCO

Average Queries: 50.0
Average Corpus: 5044.5

Aggregated for Score Function: cosine
Accuracy@1: 39.00%
Accuracy@3: 57.00%
Accuracy@5: 66.00%
Accuracy@10: 77.00%
Precision@1: 39.00%
Recall@1: 34.03%
Precision@3: 20.67%
Recall@3: 54.07%
Precision@5: 15.00%
Recall@5: 64.27%
Precision@10: 8.90%
Recall@10: 75.97%
MRR@10: 0.5004
NDCG@10: 0.5513
Aggregated for Score Function: dot
Accuracy@1: 39.00%
Accuracy@3: 57.00%
Accuracy@5: 66.00%
Accuracy@10: 77.00%
Precision@1: 39.00%
Recall@1: 34.03%
Precision@3: 20.67%
Recall@3: 54.07%
Precision@5: 15.00%
Recall@5: 64.27%
Precision@10: 8.90%
Recall@10: 75.97%
MRR@10: 0.5004
NDCG@10: 0.5513
'''
logger.info(evaluator.primary_metric)
# => "cosine_ndcg@10"
logger.info(results["mean"][evaluator.primary_metric])
# => 0.5512516989358924

(Note that this depends on #2951)

@tomaarsen
Copy link
Collaborator

Although the Be portion obviously stands for Benchmark, I think the abbreviated "BEIR" is usually fully capitalized, so I'd like to propagate that in this PR as well.

@tomaarsen
Copy link
Collaborator

tomaarsen commented Oct 17, 2024

I'm experimenting with having all outputs in the final dict, rather than a nested dict. This way, people can use any value from the evaluator to guide their e.g. early stopping. It should also match the SequentialEvaluator performance, even though the results from the NanoBEIR are now a bit hectic (i.e., one massive dict).

I hope it's okay if I push into this PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants