Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question About Frozen Text Encoder #7

Open
kimwongyuda opened this issue Oct 11, 2024 · 7 comments
Open

Question About Frozen Text Encoder #7

kimwongyuda opened this issue Oct 11, 2024 · 7 comments

Comments

@kimwongyuda
Copy link

I am really inspired of and thank for your nice work.

The question is "Why is text encoder frozen when training?".

When I fine-tune VISTA model using other dataset such as M-BEIR, the results without freezing is better than those with frozen text encoder.

I just wonder your intention of frozen text encoder.

Thank you.

@JUNJIE99
Copy link
Owner

JUNJIE99 commented Oct 11, 2024

Hi,

Thank you for your interest in our work!

The primary motivation behind VISTA is to enhance a pre-trained text encoder with visual capabilities while preserving its strong text retrieval performance. We believe that the quality of text embeddings is crucial for (multimodal) dense retrieval, particularly in tasks involving multimodal document retrieval with significant textual content, such as the WebQA and ReMuQ benchmarks.

In developing VISTA, our main concern has been its generalization capability, particularly its zero-shot performance. We have also observed that for specific tasks, not freezing the text encoder can yield better results. Consequently, in our paper, we have kept the text encoder unfrozen for downstream fine-tuning (Section 4.2). However, we think that fine-tuning the text encoder on specific datasets might compromise VISTA's inherent text retrieval capabilities derived from the BGE model.

Regarding your application scenario, while M-BEIR encompasses various retrieval contexts, it includes training data that makes downstream tasks in-domain tests. Therefore, I believe it is quite reasonable that performance improves when the text encoder is not frozen.

Lastly, I greatly appreciate your efforts in testing VISTA on M-BEIR. I would love to know if you could share your fine-tuning results with me. I am also very curious about VISTA's performance on downstream tasks within M-BEIR.

Thank you!

@kimwongyuda
Copy link
Author

Thank you! I share the result!
Metric is Recall@10 in Fashion200k and FashionIQ and Recall@5 in others.

VISTA(not FT)
VisualNews (T -> I): 0.0013
MSCOCO (T -> I): 0.0059
Fashion200k (T -> I): 0.0006
WebQA (T -> T): 0.7572
EDIS (T -> TI): 0.2049
WebQA (T -> TI): 0.5257
VisualNews (I -> T): 0.0001
MSCOCO (I -> T): 0.005
Fashion200k (I -> T): 0
NIGHTS (I -> I): 0.2212
OVEN (TI -> T): 0.0024
Infoseek (TI -> T): 0.0009
FashionIQ (TI -> I): 0.1556
CIRR (TI -> I): 0.1612
OVEN (TI -> TI): 0.4204
InfoSeek (TI -> TI): 0.3065

VISTA(FT with frozen)
VisualNews (T -> I): 0.019
MSCOCO (T -> I): 0.4731
Fashion200k (T -> I): 0.1001
WebQA (T -> T): 0.758
EDIS (T -> TI): 0.3505
WebQA (T -> TI): 0.6233
VisualNews (I -> T): 0.0218
MSCOCO (I -> T): 0.3852
Fashion200k (I -> T): 0.0636
NIGHTS (I -> I): 0.3104
OVEN (TI -> T): 0.0056
Infoseek (TI -> T): 0.0184
FashionIQ (TI -> I): 0.0596
CIRR (TI -> I): 0.1597
OVEN (TI -> TI): 0.4625
InfoSeek (TI -> TI): 0.3367

VISTA(FT without frozen)
VisualNews (T -> I): 0.0951
MSCOCO (T -> I): 0.601
Fashion200k (T -> I): 0.1518
WebQA (T -> T): 0.8037
EDIS (T -> TI): 0.3993
WebQA (T -> TI): 0.7208
VisualNews (I -> T): 0.1126
MSCOCO (I -> T): 0.8388
Fashion200k (I -> T): 0.152
NIGHTS (I -> I): 0.3014
OVEN (TI -> T): 0.3298
Infoseek (TI -> T): 0.217
FashionIQ (TI -> I): 0.1794
CIRR (TI -> I): 0.3153
OVEN (TI -> TI): 0.5169
InfoSeek (TI -> TI): 0.4259

@JUNJIE99
Copy link
Owner

Many thanks for sharing these results; I really appreciate it!

I have another question: Did you use the instruction method mentioned in UniIR for your fine-tuning?

@kimwongyuda
Copy link
Author

kimwongyuda commented Oct 11, 2024

Yes I used instructions in training and also evaluation.
However, instructions in UniIR are fixed and static for each task, while instructions in VISTA are more various because instructions don't depend on each task but each instance.

I think the definitions of instruction are different between UniIR and VISTA.

So I expect the difference is the reason zero-shot result of VISTA is not good at M-BEIR evaluation with instructions.

@JUNJIE99
Copy link
Owner

Thank you. I would like to confirm once more: Were these results obtained using the complete M-BEIR corpus, as referenced in Table 2 of the UniIR paper?

@kimwongyuda
Copy link
Author

Thank you. I would like to confirm once more: Were these results obtained using the complete M-BEIR corpus, as referenced in Table 2 of the UniIR paper?

Yes they were!

@JUNJIE99
Copy link
Owner

Thank you for your response. It appears that VISTA continues to show outstanding zero-shot performance on M-BEIR compared to UniIR.

Regarding the fine-tuning of VISTA with instructions, I believe it has the potential to achieve even better results due to its early fusion of image and text tokens. This is definitely worth exploring further.

I greatly appreciate the results you've shared and our ongoing discussions. These insights are incredibly valuable to me.

If you have any further questions, I am always open to more discussions.

Thank you for your time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants