This repo provides the code for reproducing the experiments in paper COCO-DR: Combating Distribution Shifts in Zero-Shot Dense Retrieval with Contrastive and Distributionally Robust Learning (EMNLP 2022 Main Conference).
COCO-DR is a domain adaptation method for training zero-shot dense retrievers. It is based on simple continuous constrastive learning (COCO) and implicit distributional robust learning (iDRO) and can achieve significant improvement over other zero-shot models without using billion-scale models, seq2seq models, and cross-encoder distillation.
- BEIR Performance
- Model Checkpoints
- Using COCO-DR with Huggingface
- Train COCO-DR
- Evaluating on BEIR
- Bugs or Questions?
- Citation
Model | BM25 | DocT5query | GTR | CPT-text | GPL | COCO-DR Base | COCO-DR Large |
---|---|---|---|---|---|---|---|
# of Parameter | --- | --- | 4.8B | 178B | 66M*18 | 110M | 335M |
Avg. on BEIR CPT sub | 0.484 | 0.495 | 0.516 | 0.528 | 0.516 | 0.520 | 0.540 |
Avg. on BEIR | 0.428 | 0.440 | 0.458 | --- | 0.459 | 0.461 | 0.484 |
Note:
GPL
trains a separate model for each task and use cross-encoders for distillation.CPT-text
evaluate only on 11 selected subsets of the BEIR benchmark.
- We use this docker image for all our experiments:
mmdog/pytorch:pytorch1.9.0-nccl2.9.9-cuda11.3
. - For additional packages, please run the following commands in folders.
We use BEIR corpora for the COCO step, and use MS Marco
dataset in the iDRO step. The procedure for obtaining the datasets will be described as follows.
- We use the dataset from the same source as the ANCE paper. The commands for downloading the MS Marcodataset can be found in
commands/data_download.sh
- We use the dataset released by the original BEIR repo. It can be downloaded at this link.
- Note that due to copyright restrictions, some datasets are not available.
To run the experiments, use the following commands:
The code for reproducing COCO pretraining is in the COCO
folder. Please checkout the COCO/README.md
for detailed instructions. Note that we start COCO pretraining from the condenser
checkpoint. We release the condenser
checkpoint using BERT Large as the backbone at this link.
- BM25 Warmup
- The code for BM25 warmup is in the warmup folder.
- Training with global hard negative (ANCE):
- The code for ANCE fine-tuning is in the ANCE folder.
The code for evaluation on BEIR is in the evaluate folder.
We release the following checkpoints for both COCO-DR Base
and COCO-DR Large
to facilitate future studies:
- Pretrained model after COCO step w/o finetuning on MS MARCO.
- Pretrained model after iDRO step.
- Pretrained model after iDRO step (but w/o COCO). Note: this model is trained without any BEIR task information.
Model Name | Avg. on BEIR | Link |
---|---|---|
COCO-DR Base | 0.461 | OpenMatch/cocodr-base-msmarco |
COCO-DR Base (w/o COCO) | 0.447 | OpenMatch/cocodr-base-msmarco-idro-only |
COCO-DR Base (w/ BM25 Warmup) | 0.435 | OpenMatch/cocodr-base-msmarco-warmup |
COCO-DR Base (w/o Finetuning on MS MARCO) | 0.288 | OpenMatch/cocodr-base |
COCO-DR Large | 0.484 | OpenMatch/cocodr-large-msmarco |
COCO-DR Large (w/o COCO) | 0.462 | OpenMatch/cocodr-large-msmarco-idro-only |
COCO-DR Large (w/ BM25 Warmup) | 0.456 | OpenMatch/cocodr-large-msmarco-warmup |
COCO-DR Large (w/o Finetuning on MS MARCO) | 0.316 | OpenMatch/cocodr-large |
Note: We find a mismatch between the version of HotpotQA dataset we use and the HotpotQA dataset used in BEIR. We rerun the evaluation and update the number for HotpotQA using the latest version in BEIR.
Besides, to ensure reproducibility (especially for BERT-large), we also provide checkpoints from some important baselines that are re-implemented by us.
Model Name | Link |
---|---|
Condenser Large (w/o Finetuning on MS MARCO) | OpenMatch/condenser-large |
coCondenser Large (w/o Finetuning on MS MARCO) | OpenMatch/co-condenser-large |
coCondenser Large (Fine-tuned on MS MARCO) | OpenMatch/co-condenser-large-msmarco |
Pre-trained models can be loaded through the HuggingFace transformers library:
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("OpenMatch/cocodr-base-msmarco")
tokenizer = AutoTokenizer.from_pretrained("OpenMatch/cocodr-base-msmarco")
Then embeddings for different sentences can be obtained by doing the following:
sentences = [
"Where was Marie Curie born?",
"Maria Sklodowska, later known as Marie Curie, was born on November 7, 1867.",
"Born in Paris on 15 May 1859, Pierre Curie was the son of Eugène Curie, a doctor of French Catholic origin from Alsace."
]
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
embeddings = model(**inputs, output_hidden_states=True, return_dict=True).hidden_states[-1][:, :1].squeeze(1) # the embedding of the [CLS] token after the final layer
Then similarity scores between the different sentences are obtained with a dot product between the embeddings:
score01 = embeddings[0] @ embeddings[1] # 216.9792
score02 = embeddings[0] @ embeddings[2] # 216.6684
If you have any questions related to the code or the paper, feel free to email Yue Yu (yueyu
at gatech
dot edu
) or open an issue. Please try to specify the problem with details so we can help you better and quicker!
If you find this repository helpful, feel free to cite our publication COCO-DR: Combating Distribution Shifts in Zero-Shot Dense Retrieval with Contrastive and Distributional Robust Learning.
@inproceedings{yu2022cocodr,
title={COCO-DR: Combating Distribution Shifts in Zero-Shot Dense Retrieval with Contrastive and Distributionally Robust Learning},
author={Yue Yu and Chenyan Xiong and Si Sun and Chao Zhang and Arnold Overwijk},
booktitle={Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing},
pages={1462--1479},
year={2022}
}
We would like to thank the authors from ANCE and Condenser for their open-source efforts.