-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MIEB] performance of VOC2007 #1792
Comments
I'll take a look. Since I haven't found scores for these models on VOC2007, I'm going to check whether this dataset is absolutely correct and first see if the current VOC2007 matches CLIP benchmark numbers on OpenAI CLIP models. There are also 2 different multilabel datasets:
Do you see any major discrepancies for VLM2vec-lora or voyage multi-modal on other tasks? |
This could be related to #1420, where the
It's not entirely 1:1 with CLIP benchmark since it's not zero-shot. If VLM2Vec-full gets 70%+, does that mean samples_per_label=64 or something? |
scores are here: The main results here were all run without changing samples_per_label I think! I reran E5-v and voyage locally and it's still 70% v.s. 20%. |
Didn't spot any major discrepancies on other tasks when putting together per-task type results! |
Ah sorry. I meant I haven't found reported scores by other papers on VOC2007 multilabel for e5v and voyage 😅 |
right, should've got what you meant😅. I guess there won't be any as the models are too new. I think it might be beneficial to debug with per-class accuracy etc (without using the abstask and the evaluator) to see which step might've gone wrong? |
Odd - I'm getting ~22% for both VLM2Vec-full and VLM2Vec-lora when |
In short, I was able to get 72% for both lora and full (full slightly higher than lora) when |
am just finalizing MIEB results for overall table in the paper and find VOC2007 results a bit strange https://github.com/embeddings-benchmark/tmp/tree/master (e.g., E5-v gets 70%+
lrap
and Voyage multi-modal gets like 20% - these two have very similar performance trend on all other 120+ tasks (voyage multi-modal typically is slightly better); VLM2Vec-full gets 70%+ and VLM2vec-lora gets ~20% as well).Since this is the only multi-label classification task, I am trying to merge it into the regular linear-probe task type in the leaderboard table. This is pretty much the last task that needs to take a look if anyone has bandwidth!
cc @isaac-chung
The text was updated successfully, but these errors were encountered: