Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: sorting benchmark tasks by MTEB, then alphabetical #1271

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

sathviknallamalli
Copy link
Collaborator

@sathviknallamalli sathviknallamalli commented Oct 3, 2024

Addresses #1257

Checklist

  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.
Example printout
$ mteb available_benchmarks
──────────────────────────────────────────────────────────────────────────── MTEB(code)  ─────────────────────────────────────────────────────────────────────────────
Retrieval
    - AppsRetrieval, p2p
    - CodeEditSearchRetrieval, p2p, multilingual 13 / 13 Subsets
    - CodeFeedbackMT, p2p
    - CodeFeedbackST, p2p
    - CodeSearchNetCCRetrieval, p2p, multilingual 6 / 6 Subsets
    - CodeSearchNetRetrieval, p2p, multilingual 6 / 6 Subsets
    - CodeTransOceanContest, p2p
    - CodeTransOceanDL, p2p
    - COIRCodeSearchNetRetrieval, p2p, multilingual 6 / 6 Subsets
    - CosQA, p2p
    - StackOverflowQA, p2p
    - SyntheticText2SQL, p2p


───────────────────────────────────────────────────────────────────────────── MTEB(deu)  ─────────────────────────────────────────────────────────────────────────────
Classification
    - AmazonCounterfactualClassification, s2s, multilingual 1 / 4 Subsets
    - AmazonReviewsClassification, s2s, multilingual 1 / 6 Subsets
    - MassiveIntentClassification, s2s, multilingual 1 / 51 Subsets
    - MassiveScenarioClassification, s2s, multilingual 1 / 51 Subsets
    - MTOPDomainClassification, s2s, multilingual 1 / 6 Subsets
    - MTOPIntentClassification, s2s, multilingual 1 / 6 Subsets


Clustering
    - BlurbsClusteringP2P, p2p
    - BlurbsClusteringS2S, s2s
    - TenKGnadClusteringP2P, p2p
    - TenKGnadClusteringS2S, s2s


PairClassification
    - FalseFriendsGermanEnglish, s2s
    - PawsXPairClassification, s2s, multilingual 1 / 7 Subsets


Reranking
    - MIRACLReranking, s2s, multilingual 1 / 18 Subsets


Retrieval
    - GerDaLIR, s2p
    - GermanDPR, s2p
    - GermanQuAD-Retrieval, s2p
    - XMarket, s2p, multilingual 1 / 3 Subsets


STS
    - GermanSTSBenchmark, s2s
    - STS22, p2p, multilingual 4 / 18 Subsets


───────────────────────────────────────────────────────────────────────────── MTEB(eng)  ─────────────────────────────────────────────────────────────────────────────
Classification
    - AmazonCounterfactualClassification, s2s, multilingual 2 / 4 Subsets
    - AmazonPolarityClassification, p2p
    - AmazonReviewsClassification, s2s, multilingual 1 / 6 Subsets
    - Banking77Classification, s2s
    - EmotionClassification, s2s
    - ImdbClassification, p2p
    - MassiveIntentClassification, s2s, multilingual 1 / 51 Subsets
    - MassiveScenarioClassification, s2s, multilingual 1 / 51 Subsets
    - MTOPDomainClassification, s2s, multilingual 1 / 6 Subsets
    - MTOPIntentClassification, s2s, multilingual 1 / 6 Subsets
    - ToxicConversationsClassification, s2s
    - TweetSentimentExtractionClassification, s2s


Clustering
    - ArxivClusteringP2P, p2p
    - ArxivClusteringS2S, s2s
    - BiorxivClusteringP2P, p2p
    - BiorxivClusteringS2S, s2s
    - MedrxivClusteringP2P, p2p
    - MedrxivClusteringS2S, s2s
    - RedditClustering, s2s
    - RedditClusteringP2P, p2p
    - StackExchangeClustering, s2s
    - StackExchangeClusteringP2P, p2p
    - TwentyNewsgroupsClustering, s2s


PairClassification
    - SprintDuplicateQuestions, s2s
    - TwitterSemEval2015, s2s
    - TwitterURLCorpus, s2s


Reranking
    - AskUbuntuDupQuestions, s2s
    - MindSmallReranking, s2s
    - SciDocsRR, s2s
    - StackOverflowDupQuestions, s2s


Retrieval
    - ArguAna, s2p
    - ClimateFEVER, s2p
    - CQADupstackAndroidRetrieval, s2p
    - CQADupstackEnglishRetrieval, s2p
    - CQADupstackGamingRetrieval, s2p
    - CQADupstackGisRetrieval, s2p
    - CQADupstackMathematicaRetrieval, s2p
    - CQADupstackPhysicsRetrieval, s2p
    - CQADupstackProgrammersRetrieval, s2p
    - CQADupstackStatsRetrieval, s2p
    - CQADupstackTexRetrieval, s2p
    - CQADupstackUnixRetrieval, s2p
    - CQADupstackWebmastersRetrieval, s2p
    - CQADupstackWordpressRetrieval, s2p
    - DBPedia, s2p
    - FEVER, s2p
    - FiQA2018, s2p
    - HotpotQA, s2p
    - MSMARCO, s2p
    - NFCorpus, s2p
    - NQ, s2p
    - QuoraRetrieval, s2s
    - SCIDOCS, s2p
    - SciFact, s2p
    - Touche2020, s2p
    - TRECCOVID, s2p


STS
    - BIOSSES, s2s
    - SICK-R, s2s
    - STS12, s2s
    - STS13, s2s
    - STS14, s2s
    - STS15, s2s
    - STS16, s2s
    - STS17, s2s, multilingual 8 / 11 Subsets
    - STS22, p2p, multilingual 5 / 18 Subsets
    - STSBenchmark, s2s


Summarization
    - SummEval, p2p


───────────────────────────────────────────────────────────────────────────── MTEB(fra)  ─────────────────────────────────────────────────────────────────────────────
Classification
    - AmazonReviewsClassification, s2s, multilingual 1 / 6 Subsets
    - MasakhaNEWSClassification, s2s, multilingual 1 / 16 Subsets
    - MassiveIntentClassification, s2s, multilingual 1 / 51 Subsets
    - MassiveScenarioClassification, s2s, multilingual 1 / 51 Subsets
    - MTOPDomainClassification, s2s, multilingual 1 / 6 Subsets
    - MTOPIntentClassification, s2s, multilingual 1 / 6 Subsets


Clustering
    - AlloProfClusteringP2P, p2p
    - AlloProfClusteringS2S, s2s
    - HALClusteringS2S, s2s
    - MasakhaNEWSClusteringP2P, p2p, multilingual 1 / 16 Subsets
    - MasakhaNEWSClusteringS2S, s2s, multilingual 1 / 16 Subsets
    - MLSUMClusteringP2P, p2p, multilingual 1 / 4 Subsets
    - MLSUMClusteringS2S, s2s, multilingual 1 / 4 Subsets


PairClassification
    - OpusparcusPC, s2s, multilingual 1 / 6 Subsets
    - PawsXPairClassification, s2s, multilingual 1 / 7 Subsets


Reranking
    - AlloprofReranking, s2p
    - SyntecReranking, s2p


Retrieval
    - AlloprofRetrieval, s2p
    - BSARDRetrieval, s2p
    - MintakaRetrieval, s2p, multilingual 1 / 8 Subsets
    - SyntecRetrieval, s2p
    - XPQARetrieval, s2p, multilingual 3 / 36 Subsets


STS
    - SICKFr, s2s
    - STS22, p2p, multilingual 3 / 18 Subsets
    - STSBenchmarkMultilingualSTS, s2s, multilingual 1 / 10 Subsets


Summarization
    - SummEvalFr, p2p


───────────────────────────────────────────────────────────────────────────── MTEB(kor)  ─────────────────────────────────────────────────────────────────────────────
Classification
    - KLUE-TC, s2s


Reranking
    - MIRACLReranking, s2s, multilingual 1 / 18 Subsets


Retrieval
    - Ko-StrategyQA, s2p
    - MIRACLRetrieval, s2p, multilingual 1 / 18 Subsets


STS
    - KLUE-STS, s2s
    - KorSTS, s2s


───────────────────────────────────────────────────────────────────────────── MTEB(law)  ─────────────────────────────────────────────────────────────────────────────
Retrieval
    - AILACasedocs, p2p
    - AILAStatutes, p2p
    - GerDaLIRSmall, p2p
    - LeCaRDv2, p2p
    - LegalBenchConsumerContractsQA, s2p
    - LegalBenchCorporateLobbying, s2p
    - LegalQuAD, s2p
    - LegalSummarization, s2p


──────────────────────────────────────────────────────────────────────── MTEB(Multilingual)  ─────────────────────────────────────────────────────────────────────────
BitextMining
    - BibleNLPBitextMining, s2s, multilingual 1656 / 1656 Subsets
    - BornholmBitextMining, s2s
    - BUCC.v2, s2s, multilingual 4 / 4 Subsets
    - DiaBlaBitextMining, s2s, multilingual 2 / 2 Subsets
    - FloresBitextMining, s2s, multilingual 41412 / 41412 Subsets
    - IN22GenBitextMining, s2s, multilingual 506 / 506 Subsets
    - IndicGenBenchFloresBitextMining, s2s, multilingual 58 / 58 Subsets
    - NollySentiBitextMining, s2s, multilingual 4 / 4 Subsets
    - NorwegianCourtsBitextMining, s2s
    - NTREXBitextMining, s2s, multilingual 1916 / 1916 Subsets
    - NusaTranslationBitextMining, s2s, multilingual 11 / 11 Subsets
    - NusaXBitextMining, s2s, multilingual 11 / 11 Subsets
    - Tatoeba, s2s, multilingual 112 / 112 Subsets


Classification
    - AfriSentiClassification, s2s, multilingual 12 / 12 Subsets
    - AmazonCounterfactualClassification, s2s, multilingual 4 / 4 Subsets
    - BulgarianStoreReviewSentimentClassfication, s2s
    - CataloniaTweetClassification, s2s, multilingual 2 / 2 Subsets
    - CSFDSKMovieReviewSentimentClassification, s2s
    - CyrillicTurkicLangClassification, s2s
    - CzechProductReviewSentimentClassification, s2s
    - DalajClassification, s2s
    - DBpediaClassification, s2s
    - EstonianValenceClassification, s2s
    - FilipinoShopeeReviewsClassification, s2s
    - FinancialPhrasebankClassification, s2s
    - GreekLegalCodeClassification, s2s
    - GujaratiNewsClassification, s2s
    - IndicLangClassification, s2s
    - IndonesianIdClickbaitClassification, s2s
    - IsiZuluNewsClassification, s2s
    - ItaCaseholdClassification, s2s
    - KorSarcasmClassification, s2s
    - KurdishSentimentClassification, s2s
    - MacedonianTweetSentimentClassification, s2s
    - MasakhaNEWSClassification, s2s, multilingual 16 / 16 Subsets
    - MassiveIntentClassification, s2s, multilingual 51 / 51 Subsets
    - MultiHateClassification, s2s, multilingual 11 / 11 Subsets
    - NepaliNewsClassification, s2s
    - NordicLangClassification, s2s
    - NusaParagraphEmotionClassification, s2s, multilingual 10 / 10 Subsets
    - NusaX-senti, s2s, multilingual 12 / 12 Subsets
    - OdiaNewsClassification, s2s
    - PAC, p2p
    - PoemSentimentClassification, s2s
    - PolEmo2.0-OUT, s2s
    - PunjabiNewsClassification, s2s
    - ScalaClassification, s2s, multilingual 4 / 4 Subsets
    - SentimentAnalysisHindi, s2s
    - SinhalaNewsClassification, s2s
    - SiswatiNewsClassification, s2s
    - SlovakMovieReviewSentimentClassification, s2s
    - SwahiliNewsClassification, s2s
    - SwissJudgementClassification, s2s, multilingual 3 / 3 Subsets
    - ToxicConversationsClassification, s2s
    - TswanaNewsClassification, s2s
    - TweetTopicSingleClassification, s2s


Clustering
    - AlloProfClusteringS2S.v2, s2s
    - ArXivHierarchicalClusteringP2P, p2p
    - ArXivHierarchicalClusteringS2S, p2p
    - BigPatentClustering.v2, p2p
    - BiorxivClusteringP2P.v2, p2p
    - CLSClusteringP2P.v2, p2p
    - HALClusteringS2S.v2, s2s
    - MasakhaNEWSClusteringS2S, s2s, multilingual 16 / 16 Subsets
    - MedrxivClusteringP2P.v2, p2p
    - PlscClusteringP2P.v2, s2s
    - RomaniBibleClustering, p2p
    - SIB200ClusteringS2S, s2s, multilingual 197 / 197 Subsets
    - SNLHierarchicalClusteringP2P, p2p
    - StackExchangeClustering.v2, s2s
    - SwednClusteringP2P, p2p
    - WikiCitiesClustering, p2p
    - WikiClusteringP2P.v2, p2p, multilingual 14 / 14 Subsets


InstructionRetrieval
    - Core17InstructionRetrieval, s2p
    - News21InstructionRetrieval, s2p
    - Robust04InstructionRetrieval, s2p


MultilabelClassification
    - BrazilianToxicTweetsClassification, s2s
    - CEDRClassification, s2s
    - KorHateSpeechMLClassification, s2s
    - MalteseNewsClassification, s2s
    - MultiEURLEXMultilabelClassification, p2p, multilingual 23 / 23 Subsets


PairClassification
    - ArmenianParaphrasePC, s2s
    - CTKFactsNLI, s2s
    - indonli, s2s
    - OpusparcusPC, s2s, multilingual 6 / 6 Subsets
    - PawsXPairClassification, s2s, multilingual 7 / 7 Subsets
    - PpcPC, s2s
    - RTE3, s2s, multilingual 4 / 4 Subsets
    - SprintDuplicateQuestions, s2s
    - TERRa, s2s
    - TwitterURLCorpus, s2s
    - XNLI, s2s, multilingual 14 / 14 Subsets


Reranking
    - AlloprofReranking, s2p
    - RuBQReranking, s2p
    - T2Reranking, s2s
    - VoyageMMarcoReranking, s2s
    - WebLINXCandidatesReranking, p2p
    - WikipediaRerankingMultilingual, s2p, multilingual 16 / 16 Subsets


Retrieval
    - AILAStatutes, p2p
    - ArguAna, s2p
    - BelebeleRetrieval, s2p, multilingual 376 / 376 Subsets
    - CovidRetrieval, s2p
    - HagridRetrieval, s2p
    - LegalBenchCorporateLobbying, s2p
    - LEMBPasskeyRetrieval, s2p
    - MLQARetrieval, s2p, multilingual 49 / 49 Subsets
    - SCIDOCS, s2p
    - SpartQA, s2s
    - StackOverflowQA, p2p
    - StatcanDialogueDatasetRetrieval, s2p, multilingual 2 / 2 Subsets
    - TempReasonL1, s2s
    - TRECCOVID, s2p
    - TwitterHjerneRetrieval, p2p
    - WikipediaRetrievalMultilingual, s2p, multilingual 16 / 16 Subsets
    - WinoGrande, s2s


STS
    - FaroeseSTS, s2s
    - FinParaSTS, s2s
    - GermanSTSBenchmark, s2s
    - IndicCrosslingualSTS, s2s, multilingual 12 / 12 Subsets
    - JSICK, s2s
    - SemRel24STS, s2s, multilingual 12 / 12 Subsets
    - SICK-R, s2s
    - STS12, s2s
    - STS13, s2s
    - STS14, s2s
    - STS15, s2s
    - STS17, s2s, multilingual 11 / 11 Subsets
    - STS22.v2, p2p, multilingual 18 / 18 Subsets
    - STSB, s2s
    - STSBenchmark, s2s
    - STSES, s2s


───────────────────────────────────────────────────────────────────────────── MTEB(pol)  ─────────────────────────────────────────────────────────────────────────────
Classification
    - AllegroReviews, s2s
    - CBD, s2s
    - MassiveIntentClassification, s2s, multilingual 1 / 51 Subsets
    - MassiveScenarioClassification, s2s, multilingual 1 / 51 Subsets
    - PAC, p2p
    - PolEmo2.0-IN, s2s
    - PolEmo2.0-OUT, s2s


Clustering
    - EightTagsClustering, s2s
    - PlscClusteringP2P, s2s
    - PlscClusteringS2S, s2s


PairClassification
    - CDSC-E, s2s
    - PpcPC, s2s
    - PSC, s2s
    - SICK-E-PL, s2s


STS
    - CDSC-R, s2s
    - SICK-R-PL, s2s
    - STS22, p2p, multilingual 4 / 18 Subsets
    - STSBenchmarkMultilingualSTS, s2s, multilingual 1 / 10 Subsets


────────────────────────────────────────────────────────────────── MTEB(Retrieval w/Instructions)  ───────────────────────────────────────────────────────────────────
InstructionRetrieval
    - Core17InstructionRetrieval, s2p
    - News21InstructionRetrieval, s2p
    - Robust04InstructionRetrieval, s2p


───────────────────────────────────────────────────────────────────────────── MTEB(rus)  ─────────────────────────────────────────────────────────────────────────────
Classification
    - GeoreviewClassification, p2p
    - HeadlineClassification, s2s
    - InappropriatenessClassification, s2s
    - KinopoiskClassification, p2p
    - MassiveIntentClassification, s2s, multilingual 1 / 51 Subsets
    - MassiveScenarioClassification, s2s, multilingual 1 / 51 Subsets
    - RuReviewsClassification, p2p
    - RuSciBenchGRNTIClassification, p2p
    - RuSciBenchOECDClassification, p2p


Clustering
    - GeoreviewClusteringP2P, p2p
    - RuSciBenchGRNTIClusteringP2P, p2p
    - RuSciBenchOECDClusteringP2P, p2p


MultilabelClassification
    - CEDRClassification, s2s
    - SensitiveTopicsClassification, s2s


PairClassification
    - TERRa, s2s


Reranking
    - MIRACLReranking, s2s, multilingual 1 / 18 Subsets
    - RuBQReranking, s2p


Retrieval
    - MIRACLRetrieval, s2p, multilingual 1 / 18 Subsets
    - RiaNewsRetrieval, s2p
    - RuBQRetrieval, s2p


STS
    - RUParaPhraserSTS, s2s
    - RuSTSBenchmarkSTS, s2s
    - STS22, p2p, multilingual 1 / 18 Subsets


──────────────────────────────────────────────────────────────────────── MTEB(Scandinavian)  ─────────────────────────────────────────────────────────────────────────
BitextMining
    - BornholmBitextMining, s2s
    - NorwegianCourtsBitextMining, s2s


Classification
    - AngryTweetsClassification, s2s
    - DalajClassification, s2s
    - DanishPoliticalCommentsClassification, s2s
    - DKHateClassification, s2s
    - LccSentimentClassification, s2s
    - MassiveIntentClassification, s2s, multilingual 3 / 51 Subsets
    - MassiveScenarioClassification, s2s, multilingual 3 / 51 Subsets
    - NordicLangClassification, s2s
    - NoRecClassification, s2s
    - NorwegianParliamentClassification, s2s
    - ScalaClassification, s2s, multilingual 4 / 4 Subsets
    - SwedishSentimentClassification, s2s
    - SweRecClassification, s2s


Clustering
    - SNLHierarchicalClusteringP2P, p2p
    - SNLHierarchicalClusteringS2S, s2s
    - SwednClusteringP2P, p2p
    - SwednClusteringS2S, s2s
    - VGHierarchicalClusteringP2P, p2p
    - VGHierarchicalClusteringS2S, p2p


Retrieval
    - DanFEVER, p2p
    - NorQuadRetrieval, p2p
    - SNLRetrieval, p2p
    - SwednRetrieval, p2p
    - SweFaqRetrieval, s2s
    - TV2Nordretrieval, p2p
    - TwitterHjerneRetrieval, p2p


─────────────────────────────────────────────────────────────────────────────── CoIR  ────────────────────────────────────────────────────────────────────────────────
Retrieval
    - AppsRetrieval, p2p
    - CodeFeedbackMT, p2p
    - CodeFeedbackST, p2p
    - CodeSearchNetCCRetrieval, p2p, multilingual 6 / 6 Subsets
    - CodeTransOceanContest, p2p
    - CodeTransOceanDL, p2p
    - COIRCodeSearchNetRetrieval, p2p, multilingual 6 / 6 Subsets
    - CosQA, p2p
    - StackOverflowQA, p2p
    - SyntheticText2SQL, p2p


──────────────────────────────────────────────────────────────────────── MINERSBitextMining  ─────────────────────────────────────────────────────────────────────────
BitextMining
    - BUCC, s2s, multilingual 4 / 4 Subsets
    - LinceMTBitextMining, s2s, multilingual 1 / 1 Subsets
    - NollySentiBitextMining, s2s, multilingual 4 / 4 Subsets
    - NusaTranslationBitextMining, s2s, multilingual 11 / 11 Subsets
    - NusaXBitextMining, s2s, multilingual 11 / 11 Subsets
    - PhincBitextMining, s2s, multilingual 1 / 1 Subsets
    - Tatoeba, s2s, multilingual 112 / 112 Subsets

@Muennighoff
Copy link
Contributor

Nice; can you make sure tests pass?

@sathviknallamalli
Copy link
Collaborator Author

sathviknallamalli commented Oct 3, 2024

@Muennighoff Yes, just made sure tests pass by running make tests and got:
Screenshot 2024-10-02 at 5 22 34 PM

@Muennighoff
Copy link
Contributor

Linting seems to fail tho (https://github.com/embeddings-benchmark/mteb/actions/runs/11153462560/job/31001009542?pr=1271) - Maybe run ruff format .

@sathviknallamalli
Copy link
Collaborator Author

@Muennighoff should be fixed now

Copy link
Contributor

@Muennighoff Muennighoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Just some quick comments; cc @isaac-chung would be great if you took a look!

mteb/evaluation/MTEB.py Outdated Show resolved Hide resolved
mteb/evaluation/MTEB.py Outdated Show resolved Hide resolved
mteb/evaluation/MTEB.py Outdated Show resolved Hide resolved
mteb/evaluation/MTEB.py Outdated Show resolved Hide resolved
mteb/evaluation/MTEB.py Outdated Show resolved Hide resolved
@sathviknallamalli
Copy link
Collaborator Author

@Muennighoff changes pushed for code efficiency

@isaac-chung isaac-chung changed the title sorting benchmark tasks by MTEB, then alphabetical fix: sorting benchmark tasks by MTEB, then alphabetical Oct 3, 2024
Copy link
Collaborator

@isaac-chung isaac-chung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I think it's ready to merge after resolving the merge conflicts with main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants