-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
total_terms_in_collection != sum of doclengths in robust04 queries only #21
Comments
I've confirmed this. This could be due to Lucene's doclength approximation? Need to dig deeper into this though. |
I wrote a simple program to probe into this, and this is indeed the case:
The So, not a bug, just requires better documentation. I will add documentation as appropriate. |
If we are asking people to use a total number of tokens, shouldnt it be accurate? The doclengths in the posting lists are accurate. |
Well, I mean, the export is an accurate snapshot of the index? Actually, the doclengths in the postings are the lossy approximates... |
sorry, let me rephrase, are the doc lengths in the DocRecord part of the ciff file lossy approximates or accurate? |
The doclengths recorded in the |
Ok, I got it; so the doclengths in the There is a question of explicability - are we trying to produce an index of record, or just try to reproduce Lucene/Anserini's index in our own systems? Only in the latter does it make sense to keep approximations in the CIFF. The explanations in the CIFF standard shouldn't refer to any approximate values de facto, its an implementation choice of the Lucene/Anserini CIFF exporter to expose approximate values. I.e. a readme for the generated CIFF files. |
Agreed on both accounts. |
For Robust04 queries only, the sum of the doclens is 167686911, while total_terms_in_collection=174540872 in the ciff file. Why is it more? This affect the avgdoclength, and hence the BM25 scores
The text was updated successfully, but these errors were encountered: