Ensuring Effective Handling of Visual and Text Mixed Inputs with Locked Pre-trained Text Tower in Stage 2 Training #6

CarllllWang · 2024-10-09T07:29:46Z

Hi JUNJIE. In "train.bash," I found that you locked the text tower and only trained the vision tower. The weights of the text tower (BGE) are already pre-trained (BAAI/bge-base-en-v1.5), so during the training process, its weights will be completely locked. It is assumed to have the ability to encode [cls] + image tokens + text tokens, essentially updating the vision tower to generate image embeddings that are suitable for the text tower. How can we ensure that the text pre-trained BAAI/bge-base-en-v1.5 model can effectively handle visual and text mixed inputs if I only train stage-2?

full_options="
--output_dir $SAVE_PATH
--bge_model_name_or_path BAAI/bge-base-en-v1.5
--visual_model_name_or_path EVA02-CLIP-B-16
--dataloader_num_workers 1
--train_data $DATA_PATH
--train_data_image $IMAGE_PATH
--train_group_size $GROUP_SIZE
--learning_rate $LR
--fp16
--per_device_train_batch_size $BSZ_PERGPU
--dataloader_drop_last True
--normlized True
--temperature 0.02
--logging_steps 10
--num_train_epochs $EPOCH
--negatives_cross_device
--train_text_tower False
--train_vision_tower True
--resume_path $RESUME_PATH
--save_steps $SAVE_STEPS
--deepspeed $DeepSpeedConfig
--gradient_checkpointing
...

JUNJIE99 · 2024-10-09T07:42:37Z

Hi,

During the two-stage training process of VISTA, the Text Encoder remains locked. This is because our goal is to introduce visual capabilities to a powerful general-purpose text embedding model without affecting its original text embedding capabilities. We believe that text embedding capability remains a fundamental ability in multimodal retrieval.

We believe our experiments in various multimodal retrieval tasks have demonstrated that using a pre-trained text encoder to handle mixed modality token sequences is effective. Our model has achieved state-of-the-art performance in zero-shot multimodal retrieval tasks, especially in text-heavy tasks such as WebQA and ReMuQ.

CarllllWang · 2024-10-09T08:00:55Z

Thank you very much for your response. Let's consider a approach: could we first fine-tune a pure text BGE model (such as bge-large-zh-v1.5), and then use the fine-tuned BGE as the text tower for Stage 2 training to fine-tune the vision tower?
For my data. The query is single-modal, while the items actually have both text and image modalities, similar to the T2IT task . My original approach was single-modal text retrieval, which causes some products with visual information to completely miss out on their image features. Therefore, I hope to achieve better performance than the text BGE through VISTA. Would this fine-tuned model perform better in multimodal retrieval (with text as the query and items consisting of text + images) compared to the original pure text BGE model?

JUNJIE99 · 2024-10-09T08:40:23Z

If you fine-tune the BGE Text Encoder first, I believe you still need to perform image-text pre-training (i.e., the first stage training in our paper) to align the visual encoder with your newly trained text encoder. This is because the second stage training data (e.g., T2IT) is usually not large enough to align a visual encoder to a new text encoder space. Therefore, using large-scale image-text data for alignment is necessary.

Of course, you can directly load VISTA's visual encoder into your fine-tuned BGE Text Encoder, but I'm not sure if your data scale is sufficient to help the visual encoder align with the text embedding space.

CarllllWang · 2024-10-09T08:53:51Z

Thank you for your response. As you mentioned, the first stage of training seems necessary. However, I noticed in the "train.bash" of your "stage2_training_code" that the loaded BGE model appears to be an original "BAAI/bge-base-en-v1.5," which has not undergone the first stage of training. This suggests that the model is a pure text pre-trained BGE model, but it is fixed during the second stage of training. If we train it this way, doesn't that mean we are using text encoder model used for stage-2 is not aligned with visual features?

JUNJIE99 · 2024-10-09T09:11:26Z

Sorry for any misunderstanding caused by the training code not being as tidy. The Automodel.from_pretrained() method is used in the model initialization function merely to initialize the model architecture; the weights will actually be overwritten. As you can see in line 116 of the run_stage2_fusion.py file, we reload the VISTA weights (i.e., the model weights obtained from the first stage).

For a more streamlined approach, you can refer to the model initialization method in the main repository here that uses the config file to initialize the model architecture and avoids downloading the weight files repeatedly.

CarllllWang · 2024-10-09T09:53:03Z

Thank you very much for your response. I didn't see that there is a model parameter loading here. Should the RESUME_PATH refer to the model trained in the first stage? Is it possible to provide a trained text encoder that has already been aligned with visual features? Furthermore, does this mean that the original BGE text model can no longer be fine-tuned? For my data, the originally pre-trained BGE also needs fine-tuning to achieve better results. By loading the RESUME_PATH, it implies that the BGE model that hasn’t been trained by my down data is locked during the subsequent second stage training.

JUNJIE99 · 2024-10-09T14:07:16Z

Yes, the RESUME_PATH refers to the model weights from the first stage that need to be loaded during the second stage of training.

Regarding your question, "Is it possible to provide a trained text encoder that has already been aligned with visual features?" In our approach, we align the visual encoder to the pre-trained text encoder, not the other way around.

As for whether the original BGE text model can still be fine-tuned, I believe it can. However, the visual encoder weights we provide are aligned with the original BGE model. This means that our provided weights cannot be directly used with your fine-tuned model. If your fine-tuning of the BGE model is minimal, you might be able to achieve a good alignment again by re-aligning the visual encoder with your fine-tuned BGE model using relatively little data. However, the effectiveness of this approach depends on the extent of your prior fine-tuning and the amount of data available for re-alignment. So, while theoretically possible, I cannot guarantee it will work effectively in your specific case.

I hope this helps clarify things for you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensuring Effective Handling of Visual and Text Mixed Inputs with Locked Pre-trained Text Tower in Stage 2 Training #6

Ensuring Effective Handling of Visual and Text Mixed Inputs with Locked Pre-trained Text Tower in Stage 2 Training #6

CarllllWang commented Oct 9, 2024

JUNJIE99 commented Oct 9, 2024

CarllllWang commented Oct 9, 2024

JUNJIE99 commented Oct 9, 2024

CarllllWang commented Oct 9, 2024

JUNJIE99 commented Oct 9, 2024 •

edited

Loading

CarllllWang commented Oct 9, 2024

JUNJIE99 commented Oct 9, 2024

Ensuring Effective Handling of Visual and Text Mixed Inputs with Locked Pre-trained Text Tower in Stage 2 Training #6

Ensuring Effective Handling of Visual and Text Mixed Inputs with Locked Pre-trained Text Tower in Stage 2 Training #6

Comments

CarllllWang commented Oct 9, 2024

JUNJIE99 commented Oct 9, 2024

CarllllWang commented Oct 9, 2024

JUNJIE99 commented Oct 9, 2024

CarllllWang commented Oct 9, 2024

JUNJIE99 commented Oct 9, 2024 • edited Loading

CarllllWang commented Oct 9, 2024

JUNJIE99 commented Oct 9, 2024

JUNJIE99 commented Oct 9, 2024 •

edited

Loading