Save the ColBERT encodings to disk. #237

Diegi97 · 2024-08-09T10:02:32Z

I have a use case where I run ColBERT on CPU on a couple thousand documents. For this I don't use PLAID but the encode and search_encoded_docs methods and the search works fast enough, the problem is that encoding all these documents on CPU takes time and I don't want to encode everything everytime I deploy the model so I developed a way for saving and loading these encodings:

https://github.com/ChatFAQ/ChatFAQ/blob/cc19e4b85198062888d6320e59276db31461f4e9/chat_rag/chat_rag/retrievers/colbert_retriever.py#L163

If interested I could improve and integrate this into the RAGPretrainedModel or ColBERT classes and make a PR.

The text was updated successfully, but these errors were encountered:

faezs · 2024-08-09T13:31:11Z

I'd like this, same workflow as you and similar solution but having it be built in would be great. maybe having it be compatible with overwrite_index for cache invalidation would also be a good idea?

bclavie · 2024-08-12T18:55:53Z

This is coming as part of the overhaul I semi-announced on twitter (just on twitter, to stay lowkey...)

I have no exact ETA but these features will be available on the overhaul branch (which isn't installable right now as it'll crash, but will be very soon) within the next couple weeks.

If you have just ~2k documents and want to improve latency, the best way forward will most likely to use the HNSW index that'll ship as the native indexing mechanism for any collections under ~5k documents. It gets performance more or less matching exact search while being quite a bit quicker. Otherwise, something pretty similar to your mechanism will be added for loading/saving in-memory encodings.

Thanks for your interest!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save the ColBERT encodings to disk. #237

Save the ColBERT encodings to disk. #237

Diegi97 commented Aug 9, 2024 •

edited

Loading

faezs commented Aug 9, 2024

bclavie commented Aug 12, 2024

Save the ColBERT encodings to disk. #237

Save the ColBERT encodings to disk. #237

Comments

Diegi97 commented Aug 9, 2024 • edited Loading

faezs commented Aug 9, 2024

bclavie commented Aug 12, 2024

Diegi97 commented Aug 9, 2024 •

edited

Loading