-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LoCo Benchmark - BM25 & Insights #23
Comments
Interesting, this is a really great analysis! We also noticed this and have been working on an update to the benchmark (LoCoV1). We haven't put it out yet but will do soon (and add this as a great baseline). |
Thank you for sharing @calpt! If you have an evaluation script for BM25 available, I'd love to take a look and try it out on our new evaluation datasets. |
+1, would love to see the script @calpt! The scores a a good bit higher than when we ran BM25 internally so would love to see if we did something wrong! |
Sure, I basically just took your loco_eval.py script, removed everything but the data loading, plugged in the BM25 implementation & eval of BEIR (roughly like this: https://gist.github.com/calpt/56d0d47724a061c4a7bd4a9a8fd990d2) and spun up a local ES docker container. Looking forward to LoCo v1! |
Great, we’ll take a look! CC @jonsaadfalcon
…On Thu, Feb 8, 2024 at 10:08 AM calpt ***@***.***> wrote:
Sure, I basically just took your loco_eval.py script, removed everything
but the data loading, plugged in the BM25 implementation & eval of BEIR
(roughly like this:
https://gist.github.com/calpt/56d0d47724a061c4a7bd4a9a8fd990d2) and spun
up a local ES docker container
<https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started.html>
.
Looking forward to LoCo v1!
—
Reply to this email directly, view it on GitHub
<#23 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABDDIITWTPU3RZTC6YL5P2DYSUIAXAVCNFSM6AAAAABC6KVFQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZUGY3TQMJQGA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Hi Elliott, thanks for the interest!
We have an updated LoCoV1 described in the arXiv (
https://arxiv.org/abs/2402.07440v2) - will have it on HF with updated
checkpoints soon (we ran into ICLR rebuttals before we got a chance to
clean it up for upload).
If you DM/email me and Jon we can try to share access to the private HF
dataset?
…On Thu, Mar 28, 2024 at 9:57 AM Elliott Choi ***@***.***> wrote:
Hey @DanFu09 <https://github.com/DanFu09> would love to know if you have
an update on this!
Our team, at Cohere, will likely report on an adjusted version of QMSum
(What @calpt <https://github.com/calpt> described above)
—
Reply to this email directly, view it on GitHub
<#23 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABDDIIW733ORKRYBEDX6FO3Y2QOUPAVCNFSM6AAAAABC6KVFQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRVGI2TAOJVGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hello @DanFu09! I found this benchmark quite exciting and was wondering if you got the chance to upload the newer version to HuggingFace. |
@iNeil77 here you go, Jon's tweet and blog has links: https://x.com/JonSaadFalcon/status/1792623213698232808 |
Hey, thanks for sharing this very interesting work!
I was interested in the recent LoCo benchmark composed for long-context retrieval and found it useful to have results for a very simple lexical baseline method first to put the scores in the blog post into context. As this was not yet done in the blog post, I ran BM25 (via ElasticSearch) on all benchmark tasks based on your eval script. Full results, in comparison to the best-performing M2-BERT-32768 (80M), below (NDCG@10 for all).
BM25
BM25 seems to be very competitive on LoCo, coming close to the best model tested in the post's evaluation and outperforming all other tested embedding models. Thus, lexical overlap between queries and correct documents seems to be very high on the benchmark tasks.
QMSum Analysis
Looking a bit closer at the results, we can see that for 4 of 5 tasks, NDCG is well above 90, meaning that BM25 is nearly perfectly able to retrieve the correct documents. The only exception is QMSum, so I looked into its data a bit closer:
Originally, QMSum is a summarization dataset consisting of three text fragments: a corpus of 232 long meeting transcript, a set of 272 questions and 272 query-based summarizations of the transcripts. In the tau/scrolls format, queries and transcripts are joined together in the "input" field whereas summaries are given in the "output" field. This gives 272 pairs of inputs-outputs. LoCo now simply uses "output" as query and "input" as document, giving 272 queries and 272 documents.
This means that in the LoCo doc corpus of QmSum multiple documents are based off the same long meeting transcript, paired with different questions. E.g. for the first 4 documents are:
The truncated part is identical in all four, meaning that the overwhelming part of the documents (with 9748 words on average) is identical apart from the question stated in the first few words. For distinguishing between these groups of documents, only the first few words are therefore relevant.
As an ablation, I removed the questions at the start of all documents and "merged" the resulting identical documents into one and then ran BM25 again. This improves NDCG@10 to 78.7.
Just wanted to share these quick insights into the LoCo benchmark, maybe this is useful to someone!
The text was updated successfully, but these errors were encountered: