OpenMatch v2 is an all-in-one toolkit for information retrieval (IR) currently under active development. It supports training and evaluation of various dense retrievers and re-rankers with deep integration of Huggingface Transformers and Datasets.
- Human-friendly interface for dense retriever and re-ranker training and testing
- Various PLMs supported (BERT, RoBERTa, T5...)
- Native support for common IR & QA Datasets (MS MARCO, NQ, KILT, BEIR, ...)
- Deep integration with Huggingface Transformers and Datasets
- Efficient training and inference via stream-style data loading
To install OpenMatch V2, follow these steps:
git clone https://github.com/OpenMatch/OpenMatch.git
cd OpenMatch
pip install -e .
Note: -e
means editable, i.e. you can change the code directly in your directory.
We do not include all the requirements in the package. You may need to manually install some dependencies based on your environment:
• torch
and tensorboard
for model training and visualization.
Install with:
pip install torch tensorboard
• faiss
for dense retrieval. Choose between faiss-cpu
or faiss-gpu
depending on your system. Make sure that the correct version of faiss-gpu
is installed for your CUDA environment.
Install faiss with:
conda install faiss-cpu -c pytorch
# or
conda install faiss-gpu -c pytorch
Note: If you encounter GPU search errors (especially with CUDA >= 11.0), you may need to install faiss-gpu
manually via conda instead of pip
.
This section demonstrates how to set up and run a simple retrieval task using OpenMatch v2.
First, select a supported dataset for training and evaluation, such as MS MARCO:
wget --no-check-certificate https://rocketqa.bj.bcebos.com/corpus/marco.tar.gz
tar -zxf marco.tar.gz
rm marco.tar.gz
python -m openmatch.driver.train_dr \
--output_dir $CHECKPOINT_DIR/msmarco/t5 \
--model_name_or_path bert-base-uncased \
--do_train \
--save_steps 20000 \
--eval_steps 20000 \
--train_path $PROCESSED_DIR/msmarco/t5/train.new.jsonl \
--eval_path $PROCESSED_DIR/msmarco/t5/val.jsonl \
--fp16 \
--per_device_train_batch_size 8 \
--num_train_epochs 3 \
--learning_rate 5e-6 \
--logging_dir $LOG_DIR/msmarco/t5 \
--evaluation_strategy steps
python -m openmatch.driver.build_index \
--output_dir $EMBEDDING_DIR/msmarco/t5 \
--model_name_or_path $CHECKPOINT_DIR/msmarco/t5 \
--per_device_eval_batch_size 256 \
--corpus_path $COLLECTION_DIR/marco/corpus.tsv \
--q_max_len 32 \
--p_max_len 128 \
--fp16
python -m openmatch.driver.retrieve \
--output_dir $RESULT_DIR/msmarco/t5 \
--model_name_or_path $CHECKPOINT_DIR/msmarco/t5 \
--query_path $COLLECTION_DIR/marco/dev.query.txt \
--trec_save_path $RESULT_DIR/msmarco/t5/dev.trec \
--fp16
python scripts/evaluate.py \
-m mrr.10 \ # Specify your evaluation metric (e.g., MRR@10)
$COLLECTION_DIR/marco/qrels.dev.tsv \
$RESULT_DIR/msmarco/t5/dev.trec
Note: This Quick Start Guide provides a streamlined process for setting up and training a dense retrieval model with OpenMatch v2. For more detailed instructions or advanced configurations, refer to the documentation.
We are actively working on the docs.
- Zhiyuan Liu
- Tsinghua University
- Homepage
- Zhenghao Liu
- Northeastern University
- Homepage
- Chenyan Xiong
- Microsoft Research AI
- Homepage
- Maosong Sun
- Tsinghua University
- Homepage
Our implementation uses Tevatron as the starting point. We thank its authors for their contributions.
We welcome contributions! To contribute:
- Fork the repository.
- Create a new branch for your changes.
- Open a pull request, ensuring that your code passes all tests and follows the project’s style guidelines.
For any inquiries, please contact [email protected].