Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MIEB] performance of VOC2007 #1792

Open
gowitheflow-1998 opened this issue Jan 13, 2025 · 8 comments
Open

[MIEB] performance of VOC2007 #1792

gowitheflow-1998 opened this issue Jan 13, 2025 · 8 comments
Labels
mieb The image extension of MTEB

Comments

@gowitheflow-1998
Copy link
Contributor

am just finalizing MIEB results for overall table in the paper and find VOC2007 results a bit strange https://github.com/embeddings-benchmark/tmp/tree/master (e.g., E5-v gets 70%+ lrap and Voyage multi-modal gets like 20% - these two have very similar performance trend on all other 120+ tasks (voyage multi-modal typically is slightly better); VLM2Vec-full gets 70%+ and VLM2vec-lora gets ~20% as well).

Since this is the only multi-label classification task, I am trying to merge it into the regular linear-probe task type in the leaderboard table. This is pretty much the last task that needs to take a look if anyone has bandwidth!

cc @isaac-chung

@gowitheflow-1998 gowitheflow-1998 added the mieb The image extension of MTEB label Jan 13, 2025
@isaac-chung
Copy link
Collaborator

I'll take a look. Since I haven't found scores for these models on VOC2007, I'm going to check whether this dataset is absolutely correct and first see if the current VOC2007 matches CLIP benchmark numbers on OpenAI CLIP models.

There are also 2 different multilabel datasets:

Do you see any major discrepancies for VLM2vec-lora or voyage multi-modal on other tasks?

@isaac-chung
Copy link
Collaborator

isaac-chung commented Jan 14, 2025

This could be related to #1420, where the samples_per_label were too low to observe scaling laws. On openai/clip-vit-base-patch32:

samples_per_label 8 16 32
map 0.660 0.757 0.801
lrap 0.660 0.757 0.801

It's not entirely 1:1 with CLIP benchmark since it's not zero-shot.

If VLM2Vec-full gets 70%+, does that mean samples_per_label=64 or something?

@gowitheflow-1998
Copy link
Contributor Author

I'll take a look. Since I haven't found scores for these models on VOC2007, I'm going to check whether this dataset is absolutely correct and first see if the current VOC2007 matches CLIP benchmark numbers on OpenAI CLIP models.

scores are here:
Voyage: https://github.com/embeddings-benchmark/tmp/blob/master/voyage-multimodal-3/1/VOC2007.json
E5-v: https://github.com/embeddings-benchmark/tmp/blob/master/royokong__e5-v/0c1f22679417b3ae925d779442221c40cd1861ab/VOC2007.json
and similarly for VLM2Vec;

The main results here were all run without changing samples_per_label I think! I reran E5-v and voyage locally and it's still 70% v.s. 20%.

@gowitheflow-1998
Copy link
Contributor Author

Do you see any major discrepancies for VLM2vec-lora or voyage multi-modal on other tasks?

Didn't spot any major discrepancies on other tasks when putting together per-task type results!

@isaac-chung
Copy link
Collaborator

Since I haven't found scores for these models on VOC2007

Ah sorry. I meant I haven't found reported scores by other papers on VOC2007 multilabel for e5v and voyage 😅

@gowitheflow-1998
Copy link
Contributor Author

Ah sorry. I meant I haven't found reported scores by other papers on VOC2007 multilabel for e5v and voyage 😅

right, should've got what you meant😅. I guess there won't be any as the models are too new.

I think it might be beneficial to debug with per-class accuracy etc (without using the abstask and the evaluator) to see which step might've gone wrong?

@isaac-chung
Copy link
Collaborator

isaac-chung commented Jan 16, 2025

Odd - I'm getting ~22% for both VLM2Vec-full and VLM2Vec-lora when samples_per_label=8. See the tagged PR below for progress.

@isaac-chung
Copy link
Collaborator

In short, I was able to get 72% for both lora and full (full slightly higher than lora) when samples_per_label=64.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mieb The image extension of MTEB
Projects
None yet
Development

No branches or pull requests

2 participants