Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update results for Russian models #19

Merged

Conversation

artemsnegirev
Copy link
Contributor

@artemsnegirev artemsnegirev commented Aug 15, 2024

This PR updates results for several models to update Russian leaderboard (embeddings-benchmark/leaderboard#26).

Models are updated:

  • BAAI__bge-m3
  • ai-forever__sbert_large_nlu_ru
  • intfloat__e5-mistral-7b-instruct
  • intfloat__multilingual-e5-large
  • intfloat__multilingual-e5-small
  • ai-forever__sbert_large_mt_nlu_ru
  • cointegrated__rubert-tiny2
  • intfloat__multilingual-e5-base
  • intfloat__multilingual-e5-large-instruct

Most part of update brings minor changes but now uses one version (1.14.12) and kg_co2_emissions is computed (that wasn't). Results for multilingual MassiveIntent, MassiveScenario and STS22 are not changed. And instruct models now use detailed instructions from embeddings-benchmark/mteb#1163.

@artemsnegirev artemsnegirev changed the title Add MIRACL results for ruMTEB Update results for Russian models Sep 2, 2024
@Samoed
Copy link
Contributor

Samoed commented Sep 2, 2024

@KennethEnevoldsen @Muennighoff Can you merge this?

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the ping! Looked at a few samples probably worth discussing these before merging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quite a large change in accuracy here, why is that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to reproduce the results for 1.12.75, so I guess due to some changes between this versions (1.12.75 -> 1.14.12). And it's not only for me5-small and this task (Georeview) but for almost all models and Classification/Clustering tasks.

Maybe you have any hypothesis what changes could affect results?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found that running MultiLabelClassification tasks at first cause the problem. Code to reproduce:

# mteb==1.14.12

MODEL_NAME="cointegrated/rubert-tiny2"

mteb run \
    -m $MODEL_NAME -l rus --output_folder results \
    --co2_tracker true --verbosity 2 --batch_size 16 \
    -t \
       "SensitiveTopicsClassification" \
       "GeoreviewClassification"

GeoreviewClassification is 0.408935546875 while single run gives 0.39638671875. Also checked Retrieval and Reranking tasks and they are ok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So running the MultiLabelClassification first changes the result? It seems to be an issue with a seed being manipulated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will run this tasks separately and update the results then

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will look into this in a PR as well - see if we can get it fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More generally we should probably try to fix this issue here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My results for bge-m3 for comparison
new_results.tar.gz

@artemsnegirev
Copy link
Contributor Author

@KennethEnevoldsen I've updated affected tasks. It seems Clustering also has to be run separately. Also found clustering results for sbert/rubert models in v1.12.25 gives better results and I reproduced this difference.

@KennethEnevoldsen
Copy link
Contributor

@KennethEnevoldsen I've updated affected tasks. It seems Clustering also has to be run separately. Also found clustering results for sbert/rubert models in v1.12.25 gives better results and I reproduced this difference.

Right so order also matters here. Well, we knew it was a problem given this, but we should def. get that patched up (though it might be a major version bump)

@artemsnegirev
Copy link
Contributor Author

@KennethEnevoldsen

Well, we knew it was a problem given embeddings-benchmark/mteb#942, but we should def. get that patched up (though it might be a major version bump)

We've made several experiments (@Samoed) with the bge-m3 model and could not achieve the same results. As mentioned in embeddings-benchmark/mteb#942, this is not a big problem. For now we would like to go with these results and will update them in new versions. Waiting #25 to be merged.

@KennethEnevoldsen
Copy link
Contributor

That works for me as well

@Samoed
Copy link
Contributor

Samoed commented Sep 6, 2024

@KennethEnevoldsen I think this PR can be merged. After that, I will update the paths in my branch (or whatever else we decide)

@KennethEnevoldsen KennethEnevoldsen merged commit a3326d8 into embeddings-benchmark:main Sep 6, 2024
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants