Indexing a large .jsonl that cannot be loaded in RAM #29
-
@xhluca Hi, thanks for the nice project. I have a large .jsonl file where each line got a But your implementation seems to require all ids for |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
The corpus you are passing to
That said, if you still want to pass the jsonl file to the retriever's initialization call, but don't have enough memory to load it in memory, you can take a look at the Lines 101 to 184 in 73c7dea Note however that before it does that, it needs to index the jsonl and create a |
Beta Was this translation helpful? Give feedback.
-
@xhluca This is really cool. And thanks for your prompt response! One quick question, I have tried to new a I mean, it will use default ids assigned by the |
Beta Was this translation helpful? Give feedback.
The corpus you are passing to
BM25
is optional, it is only used during retrieval and when you are saving/loading the model. You can just initialize without the corpus:That said, if you still want to pass the jsonl file to the retriever's initialization call, but don't have enough memory to load it in memory, you can take a look at the
bm25s.utils.corpus.JsonlCorpus
class which lets you read a jsonl dynamically through memory mapping:bm25s/bm25s/utils/corpus.py
Lines 101 to 184 in 73c7dea