Question About Frozen Text Encoder #7

kimwongyuda · 2024-10-11T05:09:36Z

I am really inspired of and thank for your nice work.

The question is "Why is text encoder frozen when training?".

When I fine-tune VISTA model using other dataset such as M-BEIR, the results without freezing is better than those with frozen text encoder.

I just wonder your intention of frozen text encoder.

Thank you.

JUNJIE99 · 2024-10-11T05:30:43Z

Hi,

Thank you for your interest in our work!

The primary motivation behind VISTA is to enhance a pre-trained text encoder with visual capabilities while preserving its strong text retrieval performance. We believe that the quality of text embeddings is crucial for (multimodal) dense retrieval, particularly in tasks involving multimodal document retrieval with significant textual content, such as the WebQA and ReMuQ benchmarks.

In developing VISTA, our main concern has been its generalization capability, particularly its zero-shot performance. We have also observed that for specific tasks, not freezing the text encoder can yield better results. Consequently, in our paper, we have kept the text encoder unfrozen for downstream fine-tuning (Section 4.2). However, we think that fine-tuning the text encoder on specific datasets might compromise VISTA's inherent text retrieval capabilities derived from the BGE model.

Regarding your application scenario, while M-BEIR encompasses various retrieval contexts, it includes training data that makes downstream tasks in-domain tests. Therefore, I believe it is quite reasonable that performance improves when the text encoder is not frozen.

Lastly, I greatly appreciate your efforts in testing VISTA on M-BEIR. I would love to know if you could share your fine-tuning results with me. I am also very curious about VISTA's performance on downstream tasks within M-BEIR.

Thank you!

kimwongyuda · 2024-10-11T06:06:01Z

Thank you! I share the result!
Metric is Recall@10 in Fashion200k and FashionIQ and Recall@5 in others.

VISTA(not FT)
VisualNews (T -> I): 0.0013
MSCOCO (T -> I): 0.0059
Fashion200k (T -> I): 0.0006
WebQA (T -> T): 0.7572
EDIS (T -> TI): 0.2049
WebQA (T -> TI): 0.5257
VisualNews (I -> T): 0.0001
MSCOCO (I -> T): 0.005
Fashion200k (I -> T): 0
NIGHTS (I -> I): 0.2212
OVEN (TI -> T): 0.0024
Infoseek (TI -> T): 0.0009
FashionIQ (TI -> I): 0.1556
CIRR (TI -> I): 0.1612
OVEN (TI -> TI): 0.4204
InfoSeek (TI -> TI): 0.3065

VISTA(FT with frozen)
VisualNews (T -> I): 0.019
MSCOCO (T -> I): 0.4731
Fashion200k (T -> I): 0.1001
WebQA (T -> T): 0.758
EDIS (T -> TI): 0.3505
WebQA (T -> TI): 0.6233
VisualNews (I -> T): 0.0218
MSCOCO (I -> T): 0.3852
Fashion200k (I -> T): 0.0636
NIGHTS (I -> I): 0.3104
OVEN (TI -> T): 0.0056
Infoseek (TI -> T): 0.0184
FashionIQ (TI -> I): 0.0596
CIRR (TI -> I): 0.1597
OVEN (TI -> TI): 0.4625
InfoSeek (TI -> TI): 0.3367

VISTA(FT without frozen)
VisualNews (T -> I): 0.0951
MSCOCO (T -> I): 0.601
Fashion200k (T -> I): 0.1518
WebQA (T -> T): 0.8037
EDIS (T -> TI): 0.3993
WebQA (T -> TI): 0.7208
VisualNews (I -> T): 0.1126
MSCOCO (I -> T): 0.8388
Fashion200k (I -> T): 0.152
NIGHTS (I -> I): 0.3014
OVEN (TI -> T): 0.3298
Infoseek (TI -> T): 0.217
FashionIQ (TI -> I): 0.1794
CIRR (TI -> I): 0.3153
OVEN (TI -> TI): 0.5169
InfoSeek (TI -> TI): 0.4259

JUNJIE99 · 2024-10-11T06:12:55Z

Many thanks for sharing these results; I really appreciate it!

I have another question: Did you use the instruction method mentioned in UniIR for your fine-tuning?

kimwongyuda · 2024-10-11T06:41:22Z

Yes I used instructions in training and also evaluation.
However, instructions in UniIR are fixed and static for each task, while instructions in VISTA are more various because instructions don't depend on each task but each instance.

I think the definitions of instruction are different between UniIR and VISTA.

So I expect the difference is the reason zero-shot result of VISTA is not good at M-BEIR evaluation with instructions.

JUNJIE99 · 2024-10-11T06:53:12Z

Thank you. I would like to confirm once more: Were these results obtained using the complete M-BEIR corpus, as referenced in Table 2 of the UniIR paper?

kimwongyuda · 2024-10-11T08:01:41Z

Thank you. I would like to confirm once more: Were these results obtained using the complete M-BEIR corpus, as referenced in Table 2 of the UniIR paper?

Yes they were!

JUNJIE99 · 2024-10-11T08:21:45Z

Thank you for your response. It appears that VISTA continues to show outstanding zero-shot performance on M-BEIR compared to UniIR.

Regarding the fine-tuning of VISTA with instructions, I believe it has the potential to achieve even better results due to its early fusion of image and text tokens. This is definitely worth exploring further.

I greatly appreciate the results you've shared and our ongoing discussions. These insights are incredibly valuable to me.

If you have any further questions, I am always open to more discussions.

Thank you for your time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question About Frozen Text Encoder #7

Question About Frozen Text Encoder #7

kimwongyuda commented Oct 11, 2024

JUNJIE99 commented Oct 11, 2024 •

edited

Loading

kimwongyuda commented Oct 11, 2024

JUNJIE99 commented Oct 11, 2024

kimwongyuda commented Oct 11, 2024 •

edited

Loading

JUNJIE99 commented Oct 11, 2024

kimwongyuda commented Oct 11, 2024

JUNJIE99 commented Oct 11, 2024

Question About Frozen Text Encoder #7

Question About Frozen Text Encoder #7

Comments

kimwongyuda commented Oct 11, 2024

JUNJIE99 commented Oct 11, 2024 • edited Loading

kimwongyuda commented Oct 11, 2024

JUNJIE99 commented Oct 11, 2024

kimwongyuda commented Oct 11, 2024 • edited Loading

JUNJIE99 commented Oct 11, 2024

kimwongyuda commented Oct 11, 2024

JUNJIE99 commented Oct 11, 2024

JUNJIE99 commented Oct 11, 2024 •

edited

Loading

kimwongyuda commented Oct 11, 2024 •

edited

Loading