-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensuring Effective Handling of Visual and Text Mixed Inputs with Locked Pre-trained Text Tower in Stage 2 Training #6
Comments
Hi, During the two-stage training process of VISTA, the Text Encoder remains locked. This is because our goal is to introduce visual capabilities to a powerful general-purpose text embedding model without affecting its original text embedding capabilities. We believe that text embedding capability remains a fundamental ability in multimodal retrieval. We believe our experiments in various multimodal retrieval tasks have demonstrated that using a pre-trained text encoder to handle mixed modality token sequences is effective. Our model has achieved state-of-the-art performance in zero-shot multimodal retrieval tasks, especially in text-heavy tasks such as WebQA and ReMuQ. |
Thank you very much for your response. Let's consider a approach: could we first fine-tune a pure text BGE model (such as bge-large-zh-v1.5), and then use the fine-tuned BGE as the text tower for Stage 2 training to fine-tune the vision tower? |
If you fine-tune the BGE Text Encoder first, I believe you still need to perform image-text pre-training (i.e., the first stage training in our paper) to align the visual encoder with your newly trained text encoder. This is because the second stage training data (e.g., T2IT) is usually not large enough to align a visual encoder to a new text encoder space. Therefore, using large-scale image-text data for alignment is necessary. Of course, you can directly load VISTA's visual encoder into your fine-tuned BGE Text Encoder, but I'm not sure if your data scale is sufficient to help the visual encoder align with the text embedding space. |
Thank you for your response. As you mentioned, the first stage of training seems necessary. However, I noticed in the "train.bash" of your "stage2_training_code" that the loaded BGE model appears to be an original "BAAI/bge-base-en-v1.5," which has not undergone the first stage of training. This suggests that the model is a pure text pre-trained BGE model, but it is fixed during the second stage of training. If we train it this way, doesn't that mean we are using text encoder model used for stage-2 is not aligned with visual features? |
Sorry for any misunderstanding caused by the training code not being as tidy. The For a more streamlined approach, you can refer to the model initialization method in the main repository here that uses the config file to initialize the model architecture and avoids downloading the weight files repeatedly. |
Thank you very much for your response. I didn't see that there is a model parameter loading here. Should the RESUME_PATH refer to the model trained in the first stage? Is it possible to provide a trained text encoder that has already been aligned with visual features? Furthermore, does this mean that the original BGE text model can no longer be fine-tuned? For my data, the originally pre-trained BGE also needs fine-tuning to achieve better results. By loading the RESUME_PATH, it implies that the BGE model that hasn’t been trained by my down data is locked during the subsequent second stage training. |
Yes, the RESUME_PATH refers to the model weights from the first stage that need to be loaded during the second stage of training. Regarding your question, "Is it possible to provide a trained text encoder that has already been aligned with visual features?" In our approach, we align the visual encoder to the pre-trained text encoder, not the other way around. As for whether the original BGE text model can still be fine-tuned, I believe it can. However, the visual encoder weights we provide are aligned with the original BGE model. This means that our provided weights cannot be directly used with your fine-tuned model. If your fine-tuning of the BGE model is minimal, you might be able to achieve a good alignment again by re-aligning the visual encoder with your fine-tuned BGE model using relatively little data. However, the effectiveness of this approach depends on the extent of your prior fine-tuning and the amount of data available for re-alignment. So, while theoretically possible, I cannot guarantee it will work effectively in your specific case. I hope this helps clarify things for you. |
Hi JUNJIE. In "train.bash," I found that you locked the text tower and only trained the vision tower. The weights of the text tower (BGE) are already pre-trained (BAAI/bge-base-en-v1.5), so during the training process, its weights will be completely locked. It is assumed to have the ability to encode [cls] + image tokens + text tokens, essentially updating the vision tower to generate image embeddings that are suitable for the text tower. How can we ensure that the text pre-trained BAAI/bge-base-en-v1.5 model can effectively handle visual and text mixed inputs if I only train stage-2?
full_options="
--output_dir $SAVE_PATH
--bge_model_name_or_path BAAI/bge-base-en-v1.5
--visual_model_name_or_path EVA02-CLIP-B-16
--dataloader_num_workers 1
--train_data $DATA_PATH
--train_data_image $IMAGE_PATH
--train_group_size $GROUP_SIZE
--learning_rate $LR
--fp16
--per_device_train_batch_size $BSZ_PERGPU
--dataloader_drop_last True
--normlized True
--temperature 0.02
--logging_steps 10
--num_train_epochs $EPOCH
--negatives_cross_device
--train_text_tower False
--train_vision_tower True
--resume_path $RESUME_PATH
--save_steps $SAVE_STEPS
--deepspeed $DeepSpeedConfig
--gradient_checkpointing
...
The text was updated successfully, but these errors were encountered: