-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question About Frozen Text Encoder #7
Comments
Hi, Thank you for your interest in our work! The primary motivation behind VISTA is to enhance a pre-trained text encoder with visual capabilities while preserving its strong text retrieval performance. We believe that the quality of text embeddings is crucial for (multimodal) dense retrieval, particularly in tasks involving multimodal document retrieval with significant textual content, such as the WebQA and ReMuQ benchmarks. In developing VISTA, our main concern has been its generalization capability, particularly its zero-shot performance. We have also observed that for specific tasks, not freezing the text encoder can yield better results. Consequently, in our paper, we have kept the text encoder unfrozen for downstream fine-tuning (Section 4.2). However, we think that fine-tuning the text encoder on specific datasets might compromise VISTA's inherent text retrieval capabilities derived from the BGE model. Regarding your application scenario, while M-BEIR encompasses various retrieval contexts, it includes training data that makes downstream tasks in-domain tests. Therefore, I believe it is quite reasonable that performance improves when the text encoder is not frozen. Lastly, I greatly appreciate your efforts in testing VISTA on M-BEIR. I would love to know if you could share your fine-tuning results with me. I am also very curious about VISTA's performance on downstream tasks within M-BEIR. Thank you! |
Thank you! I share the result! VISTA(not FT) VISTA(FT with frozen) VISTA(FT without frozen) |
Many thanks for sharing these results; I really appreciate it! I have another question: Did you use the instruction method mentioned in UniIR for your fine-tuning? |
Yes I used instructions in training and also evaluation. I think the definitions of instruction are different between UniIR and VISTA. So I expect the difference is the reason zero-shot result of VISTA is not good at M-BEIR evaluation with instructions. |
Thank you. I would like to confirm once more: Were these results obtained using the complete M-BEIR corpus, as referenced in Table 2 of the UniIR paper? |
Yes they were! |
Thank you for your response. It appears that VISTA continues to show outstanding zero-shot performance on M-BEIR compared to UniIR. Regarding the fine-tuning of VISTA with instructions, I believe it has the potential to achieve even better results due to its early fusion of image and text tokens. This is definitely worth exploring further. I greatly appreciate the results you've shared and our ongoing discussions. These insights are incredibly valuable to me. If you have any further questions, I am always open to more discussions. Thank you for your time. |
I am really inspired of and thank for your nice work.
The question is "Why is text encoder frozen when training?".
When I fine-tune VISTA model using other dataset such as M-BEIR, the results without freezing is better than those with frozen text encoder.
I just wonder your intention of frozen text encoder.
Thank you.
The text was updated successfully, but these errors were encountered: