Skip to content

Indexing a large .jsonl that cannot be loaded in RAM #29

Answered by xhluca
liyucheng09 asked this question in Q&A
Discussion options

You must be logged in to vote

The corpus you are passing to BM25 is optional, it is only used during retrieval and when you are saving/loading the model. You can just initialize without the corpus:

retriever = bm25s.BM25()

That said, if you still want to pass the jsonl file to the retriever's initialization call, but don't have enough memory to load it in memory, you can take a look at the bm25s.utils.corpus.JsonlCorpus class which lets you read a jsonl dynamically through memory mapping:

bm25s/bm25s/utils/corpus.py

Lines 101 to 184 in 73c7dea

class JsonlCorpus:
"""
A class to read a jsonl file line by line using mmap, allowing extremely fast
access to any line in the file. For example, y…

Replies: 2 comments 3 replies

Comment options

You must be logged in to vote
1 reply
@ddofer
Comment options

Answer selected by xhluca
Comment options

You must be logged in to vote
2 replies
@xhluca
Comment options

@xhluca
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants