Anserini: BM25 Baselines for MS MARCO Document Ranking

This page contains instructions for running BM25 baselines on the MS MARCO document ranking task. Note that there is a separate MS MARCO passage ranking task.

This exercise will require a machine with >8 GB RAM and at least 40 GB free disk space.

If you're a Waterloo undergraduate going through this guide as the screening exercise of joining my research group, make sure you do the passage ranking exercise first. Similarly, try to understand what you're actually doing, instead of simply cargo culting (i.e., blinding copying and pasting commands into a shell).

Data Prep

We're going to use the repository's root directory as the working directory. First, we need to download and extract the MS MARCO document dataset:

mkdir collections/msmarco-doc

wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docs.trec.gz -P collections/msmarco-doc

# Alternative mirror:
# wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-docs.trec.gz -P collections/msmarco-doc

To confirm, msmarco-docs.trec.gz should have MD5 checksum of d4863e4f342982b51b9a8fc668b2d0c0.

Indexing

There's no need to uncompress the file, as Anserini can directly index gzipped files. Build the index with the following command:

target/appassembler/bin/IndexCollection \
  -collection CleanTrecCollection \
  -input collections/msmarco-doc \
  -index indexes/msmarco-doc/lucene-index-msmarco \
  -generator DefaultLuceneDocumentGenerator \
  -threads 1 \
  -storePositions -storeDocvectors -storeRaw

On a modern desktop with an SSD, indexing takes around 40 minutes. There should be a total of 3,213,835 documents indexed.

Retrieval

After indexing finishes, we can do a retrieval run. The dev queries are already stored in our repo:

target/appassembler/bin/SearchCollection \
  -index indexes/msmarco-doc/lucene-index-msmarco \
  -topics tools/topics-and-qrels/topics.msmarco-doc.dev.txt \
  -topicreader TsvInt \
  -output runs/run.msmarco-doc.dev.bm25.txt \
  -parallelism 4 \
  -bm25 -hits 1000

Retrieval speed will vary by machine: On a reasonably modern desktop with an SSD, with four threads (as specified above), the run takes less than five minutes. Adjust the parallelism by changing the -parallelism argument.

After the run completes, we can evaluate with trec_eval:

$ tools/eval/trec_eval.9.0.4/trec_eval -c -mmap -mrecall.1000 \
    tools/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc.dev.bm25.txt
map                   	all	0.2309
recall_1000           	all	0.8856

Let's compare to the baselines provided by Microsoft. First, download:

wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docdev-top100.gz -P runs
gunzip runs/msmarco-docdev-top100.gz

Then, run trec_eval to compare. Note that to be fair, we restrict evaluation to top 100 hits per topic (which is what Microsoft provides):

$ tools/eval/trec_eval.9.0.4/trec_eval -c -mmap -M 100 \
    tools/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/msmarco-docdev-top100
map                   	all	0.2219

$ tools/eval/trec_eval.9.0.4/trec_eval -c -mmap -M 100 \
    tools/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc.dev.bm25.txt
map                   	all	0.2302

We see that "out of the box" Anserini is already better!

This dataset is part of the MS MARCO Document Ranking Leaderboard. Let's try to reproduce runs on there!

A few minor details to pay attention to: the official metric is MRR@100, so we want to only return the top 100 hits, and the submission files to the leaderboard have a slightly different format.

target/appassembler/bin/SearchCollection \
  -index indexes/msmarco-doc/lucene-index-msmarco \
  -topics tools/topics-and-qrels/topics.msmarco-doc.dev.txt \
  -topicreader TsvInt \
  -output runs/run.msmarco-doc.leaderboard-dev.bm25base.txt -format msmarco \
  -parallelism 4 \
  -bm25 -bm25.k1 0.9 -bm25.b 0.4 -hits 100

The command above uses the default BM25 parameters (k1=0.9, b=0.4), and note we set -hits 100. Command for evaluation:

$ python tools/scripts/msmarco/msmarco_doc_eval.py \
    --judgments tools/topics-and-qrels/qrels.msmarco-doc.dev.txt \
    --run runs/run.msmarco-doc.leaderboard-dev.bm25base.txt
#####################
MRR @100: 0.23005723505603573
QueriesRanked: 5193
#####################

The above run corresponds to "Anserini's BM25, default parameters (k1=0.9, b=0.4)" on the leaderboard.

Here's the invocation for BM25 with parameters optimized for recall@100 (k1=4.46, b=0.82):

target/appassembler/bin/SearchCollection \
  -index indexes/msmarco-doc/lucene-index-msmarco \
  -topics tools/topics-and-qrels/topics.msmarco-doc.dev.txt \
  -topicreader TsvInt \
  -output runs/run.msmarco-doc.leaderboard-dev.bm25tuned.txt -format msmarco \
  -parallelism 4 \
  -bm25 -bm25.k1 4.46 -bm25.b 0.82 -hits 100

Command for evaluation:

$ python tools/scripts/msmarco/msmarco_doc_eval.py \
    --judgments tools/topics-and-qrels/qrels.msmarco-doc.dev.txt \
    --run runs/run.msmarco-doc.leaderboard-dev.bm25tuned.txt
#####################
MRR @100: 0.2770296928568702
QueriesRanked: 5193
#####################

More details on tuning BM25 parameters below...

BM25 Tuning

It is well known that BM25 parameter tuning is important. The setting of k1=0.9, b=0.4 is often used as a default.

Let's try to do better! We tuned BM25 using the queries found here: these are five different sets of 10k samples from the training queries (using the shuf command). The basic approach is grid search of parameter values in tenth increments. We tuned on each individual set and then averaged parameter values across all five sets (this has the effect of regularization). In separate trials, we optimized for:

recall@1000, since Anserini output serves as input to downstream rerankers (e.g., based on BERT), and we want to maximize the number of relevant documents the rerankers have to work with;
MRR@10, for the case where Anserini output is directly presented to users (i.e., no downstream reranking).

It turns out that optimizing for MRR@10 and MAP yields the same settings.

Here's the comparison between different parameter settings:

Setting	MRR@100	MAP	Recall@1000
Default (`k1=0.9`, `b=0.4`)	0.2301	0.2310	0.8856
Optimized for MRR@100/MAP (`k1=3.8`, `b=0.87`)	0.2784	0.2789	0.9326
Optimized for recall@100 (`k1=4.46`, `b=0.82`)	0.2770	0.2775	0.9357

As expected, BM25 tuning makes a big difference!

Note that MRR@100 is computed with the leaderboard eval script (with 100 hits per query), while the other two metrics are computed with trec_eval (with 1000 hits per query). So, we need to use different search programs, for example:

$ target/appassembler/bin/SearchCollection \
    -index indexes/msmarco-doc/lucene-index-msmarco \
    -topics tools/topics-and-qrels/topics.msmarco-doc.dev.txt \
    -topicreader TsvInt \
    -output runs/run.msmarco-doc.dev.opt-mrr.txt \
    -parallelism 4 \
    -bm25 -bm25.k1 3.8 -bm25.b 0.87 -hits 1000

$ tools/eval/trec_eval.9.0.4/trec_eval -c -mmap -mrecall.1000 \
    tools/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc.dev.opt-mrr.txt
map                   	all	0.2789
recall_1000           	all	0.9326

$ target/appassembler/bin/SearchCollection \
    -index indexes/msmarco-doc/lucene-index-msmarco \
    -topics tools/topics-and-qrels/topics.msmarco-doc.dev.txt \
    -topicreader TsvInt \
    -output runs/run.msmarco-doc.leaderboard-dev.opt-mrr.txt -format msmarco \
    -parallelism 4 \
    -bm25 -bm25.k1 3.8 -bm25.b 0.87 -hits 100

$ python tools/scripts/msmarco/msmarco_doc_eval.py \
    --judgments tools/topics-and-qrels/qrels.msmarco-doc.dev.txt \
    --run runs/run.msmarco-doc.leaderboard-dev.opt-mrr.txt
#####################
MRR @100: 0.27836767424339787
QueriesRanked: 5193
#####################

That's it!

Reproduction Log*

Results reproduced by @edwinzhng on 2020-01-14 (commit 3964169)
Results reproduced by @nikhilro on 2020-01-21 (commit 631589e)
Results reproduced by @yuki617 on 2020-03-29 (commit 074723c)
Results reproduced by @HangCui0510 on 2020-04-23 (commit 0ae567d)
Results reproduced by @x65han on 2020-04-25 (commit f5496b9)
Results reproduced by @y276lin on 2020-04-26 (commit 8f48f8e)
Results reproduced by @stephaniewhoo on 2020-04-26 (commit 8f48f8e)
Results reproduced by @YimingDou on 2020-05-14 (commit 3b0a642)
Results reproduced by @richard3983 on 2020-05-14 (commit a65646f)
Results reproduced by @MXueguang on 2020-05-20 (commit 3b2751e)
Results reproduced by @shaneding on 2020-05-23 (commit b6e0367)
Results reproduced by @kelvin-jiang on 2020-05-24 (commit b6e0367)
Results reproduced by @adamyy on 2020-05-28 (commit a1ecfa4)
Results reproduced by @TianchengY on 2020-05-28 (commit 2947a16)
Results reproduced by @stariqmi on 2020-05-28 (commit 4914305)
Results reproduced by @justinborromeo on 2020-06-11 (commit7954eab)
Results reproduced by @yxzhu16 on 2020-07-03 (commit 68ace26)
Results reproduced by @LizzyZhang-tutu on 2020-07-13 (commit 8c98d5b)
Results reproduced by @estella98 on 2020-08-05 (commit 99092a8)
Results reproduced by @tangsaidi on 2020-08-19 (commit aba846)
Results reproduced by @qguo96 on 2020-09-07 (commit e16b3c1)
Results reproduced by @yuxuan-ji on 2020-09-08 (commit 0f9a8ec)
Results reproduced by @wiltan-uw on 2020-09-09 (commit 93d913f)
Results reproduced by @JeffreyCA on 2020-09-13 (commit bc2628b)
Results reproduced by @jhuang265 on 2020-10-15 (commit 66711b9)
Results reproduced by @rayyang29 on 2020-10-27 (commit ad8cc5a)
Results reproduced by @Dahlia-Chehata on 2020-11-12 (commit 22c0ad3)
Results reproduced by @rakeeb123 on 2020-12-07 (commit f50dcce)
Results reproduced by @jrzhang12 on 2021-01-02 (commit be4e44d)
Results reproduced by @HEC2018 on 2021-01-04 (commit 4de21ec)
Results reproduced by @KaiSun314 on 2021-01-08 (commit 113f1c7)
Results reproduced by @yemiliey on 2021-01-18 (commit 179c242)
Results reproduced by @larryli1999 on 2021-01-22 (commit 179c242)
Results reproduced by @ArthurChen189 on 2021-04-08 (commit 45a5a21)
Results reproduced by @printfCalvin on 2021-04-11 (commit d808d4a)
Results reproduced by @saileshnankani on 2021-04-26 (commit 5781c87)
Results reproduced by @andrewyguo on 2021-04-29 (commit 71f3ca6)
Results reproduced by @mayankanand007 on 2021-05-04 (commit 906ca50)
Results reproduced by @Albert-Ma on 2021-05-07 (commit 5bcbccd)
Results reproduced by @rootofallevii on 2021-05-14 (commit 626da95)
Results reproduced by @jpark621 on 2021-06-01 (commit 2591e06)
Results reproduced by @nimasadri11 on 2021-06-27 (commit 6f9352f)
Results reproduced by @mzzchy on 2021-07-05 (commit 589928b)
Results reproduced by @d1shs0ap on 2021-07-16 (commit 43ad899)
Results reproduced by @apokali on 2021-08-19 (commitad4caeb)
Results reproduced by @leungjch on 2021-09-12 (commit f79fb67)
Results reproduced by @AlexWang000 on 2021-10-10 (commit fc2ddb0)
Results reproduced by @ToluClassics on 2021-10-20 (commit fcc2aff)
Results reproduced by @manveertamber on 2021-12-05 (commit aee51ad)
Results reproduced by @lingwei-gu on 2021-12-15 (commit 30605f5)
Results reproduced by @tyao-t on 2021-12-18 (commit 6500560)
Results reproduced by @kevin-wangg on 2022-01-04 (commit c3e14dc)
Results reproduced by @vivianliu0 on 2022-01-06 (commit c3e14dc)
Results reproduced by @mikhail-tsir on 2022-01-07 (commit 806ac89)
Results reproduced by @AceZhan on 2022-01-14 (commit 7ff99e0)
Results reproduced by @jh8liang on 2022-02-06 (commit 5cdf9ec)
Results reproduced by @mayankanand007 on 2022-02-22 (commit 6a70804)
Results reproduced by @jasper-xian on 2022-03-27 (commit 2e8e9fd)
Results reproduced by @jx3yang on 2022-04-25 (commit b429218)
Results reproduced by @AreelKhan on 2022-04-27 (commit 7adee1d)
Results reproduced by @alvind1 on 2022-05-05 (commit 9b2dd5f5)
Results reproduced by @Pie31415 on 2022-06-22 (commit 6aef2eb)
Results reproduced by @aivan6842 on 2022-07-11 (commit 8010d5c)
Results reproduced by @Jasonwu-0803 on 2022-09-27 (commit b5ecc5a)
Results reproduced by @limelody on 2022-09-27 (commit 252b5e2)
Results reproduced by @minconszhang on 2022-11-25 (commit 6556550)
Results reproduced by @jingliu on 2022-12-08 (commit 6872c87)
Results reproduced by @farazkh80 on 2022-12-18 (commit 4527a5d)
Results reproduced by @Cath on 2023-01-14 (commit 732cba4)
Results reproduced by @dlrudwo1269 on 2023-03-07 (commit 4b7662c7)
Results reproduced by @aryamancodes on 2023-04-11 (commit ed89401)
Results reproduced by @Jocn2020 on 2023-04-30 (commit 30269d6)
Results reproduced by @zoehahaha on 2023-05-12 (commit b429218)
Results reproduced by @billcui57 on 2023-05-14 (commit d82b6f7)
Results reproduced by @Richard5678 on 2023-06-13 (commit 4aeb3ef)
Results reproduced by @pratyushpal on 2023-07-14 (commit 17d5fc7)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experiments-msmarco-doc.md

experiments-msmarco-doc.md

Anserini: BM25 Baselines for MS MARCO Document Ranking

Data Prep

Indexing

Retrieval

BM25 Tuning

Reproduction Log*

Files

experiments-msmarco-doc.md

Latest commit

History

experiments-msmarco-doc.md

File metadata and controls

Anserini: BM25 Baselines for MS MARCO Document Ranking

Data Prep

Indexing

Retrieval

BM25 Tuning

Reproduction Log*