This is the official repo for our WSDM'22 paper, Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval (Best Paper Award).
In this work, we propose RepCONC, which models quantization process as CONstrained Clustering and end-to-end trains the dual-encoders and the quantization method. Constrained clustering involves a clustering loss and a uniform clustering constraint. The clustering loss requires the embeddings to be around the quantization centroids to support end-to-end optimization, and the constraint forces the embeddings to be uniformly clustered to all centroids to maximize distinguishability. The training process and the clustering constraint are visualized as follows:
Training process | Constrained Clustering |
---|---|
RepCONC achieves huge compression ratios ranging from 64x to 768x. It supports fast embedding search thanks to the adoption of IVF (inverted file system). With these designs, it outperforms a wide range of first-stage retrieval methods in terms of effectiveness, memory efficiency, and time efficiency. RepCONC also substantially boosts the second-stage ranking performance, as shown below:
Install RepCONC from our code:
git clone https://github.com/jingtaozhan/RepCONC
cd RepCONC
pip install . --use-feature=in-tree-build # built in-place without first copying to a temporary directory.
Besides, two special dependencies should be installed manually: RepCONC depends on PyTorch and Faiss, which require platform-specific custom configuration. They are not listed in the requirements and the installation is left to you.
RepCONC is an easy-to-use toolbox for compressing the index of any dense retrieval models. It jointly optimizes the dense encoders and index so that high retrieval effectiveness is obtained even with a very compact index. The code separates the design of dense retrieval models and the joint optimization process, so it supports any dense retrieval model no matter whether it is built-in!
Here are several examples about how to use RepCONC to compress index for different dense retrieval models. These examples are helpful if you want to use RepCONC for your dense retrieval models. Since RepCONC has several built-in dense retrieval models, it can be directly used to compress the index of many dense models without any code work. For example:
- Compressing index of Sentence BERT on MS MARCO Passage Ranking
- Compressing index of coCondenser on MS MARCO Passage Ranking
- Compressing index of TAS-Balanced on MS MARCO Passage Ranking
Even if some dense retrieval models are not built-in, it is also very easy to apply RepCONC on them. Just make the api of model class and tokenizer consistent with the built-in ones and you are good to go. For example, ANCE and TCT-ColBERT-v2 have customized model definitions and tokenization. Here is how RepCONC compresses their indexes.
- Compressing index of ANCE on MS MARCO Passage Ranking
- Compressing index of TCT-ColBERT-v2 on MS MARCO Passage Ranking
If you find this repo useful, please consider citing our work:
@inproceedings{zhan2022learning,
author = {Zhan, Jingtao and Mao, Jiaxin and Liu, Yiqun and Guo, Jiafeng and Zhang, Min and Ma, Shaoping},
title = {Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval},
year = {2022},
publisher = {Association for Computing Machinery},
url = {https://doi.org/10.1145/3488560.3498443},
doi = {10.1145/3488560.3498443},
booktitle = {Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining},
pages = {1328–1336},
numpages = {9},
location = {Virtual Event, AZ, USA},
series = {WSDM '22}
}