Is it possible to compress 1 billion sentence embeddings (d=384) in an index under 4.5 GB? What type to use? #3541
Replies: 3 comments
-
IVF65536_HNSW32,PQ32 uses at least 40G for 1B vectors (32 bytes for the PQ + 8 bytes for the vector ID), so unclear where the 14GB number comes from. |
Beta Was this translation helpful? Give feedback.
-
Hi @mdouze Thank you so much for getting back to me on this! Is there any sort of other advice for what we might try if you can think of any? Thank you again for the help so far! |
Beta Was this translation helpful? Give feedback.
-
Thanks!
…On Mon, Jun 24, 2024 at 8:14 AM Matthijs Douze ***@***.***> wrote:
IVF65536_HNSW32,PQ32 uses at least 40G for 1B vectors (32 bytes for the PQ
+ 8 bytes for the vector ID), so unclear where the 14GB number comes from.
1B vectors in 4.5G means you allocate 4.5 bytes per sentence, which is
very little. This may be possible only if you can group the vectors in a
meaningful way (eg. if there are many small variations of the same vector).
—
Reply to this email directly, view it on GitHub
<#3541 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AONZNO7QOWUPB7ND3LMHCETZJAZW3AVCNFSM6AAAAABJ2BCTRCVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TQNRRHEZDM>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Summary
Hi, thanks to the Faiss team for making this library available!
We have a usecase where we have nearly 1 billion vectors of sentence embeddings dimension 384 each. We need to build an index of all of these and have a memory constraint of 4.5 GB max index size (ideally, we'd be a little smaller than this size, as the dataset grows daily).
From my understanding an index built with configs IVF65536_HNSW32,PQ32 would get us the smallest memory footprint [ref: https://towardsdatascience.com/ivfpq-hnsw-for-billion-scale-similarity-search-89ff2f89d90e]
but when I do this we still have an index size of ~14GB.
Is there any other combination we should try? Or is an index of 4.5 GB not possible given how big our vectors/dataset are?
Thank you!
Faiss version: faiss-cpu 1.7.3
Installed from: Pypi [https://pypi.org/project/faiss-cpu/]
Running on:
Interface:
Reproduction instructions
NA
*please let me know also if I need to include more of the details related to what version Faiss etc I am using. Since this is just a general question I omitted those currently for brevity.
Beta Was this translation helpful? Give feedback.
All reactions