Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensuring Effective Handling of Visual and Text Mixed Inputs with Locked Pre-trained Text Tower in Stage 2 Training #6

Open
CarllllWang opened this issue Oct 9, 2024 · 7 comments

Comments

@CarllllWang
Copy link

Hi JUNJIE. In "train.bash," I found that you locked the text tower and only trained the vision tower. The weights of the text tower (BGE) are already pre-trained (BAAI/bge-base-en-v1.5), so during the training process, its weights will be completely locked. It is assumed to have the ability to encode [cls] + image tokens + text tokens, essentially updating the vision tower to generate image embeddings that are suitable for the text tower. How can we ensure that the text pre-trained BAAI/bge-base-en-v1.5 model can effectively handle visual and text mixed inputs if I only train stage-2?

full_options="
--output_dir $SAVE_PATH
--bge_model_name_or_path BAAI/bge-base-en-v1.5
--visual_model_name_or_path EVA02-CLIP-B-16
--dataloader_num_workers 1
--train_data $DATA_PATH
--train_data_image $IMAGE_PATH
--train_group_size $GROUP_SIZE
--learning_rate $LR
--fp16
--per_device_train_batch_size $BSZ_PERGPU
--dataloader_drop_last True
--normlized True
--temperature 0.02
--logging_steps 10
--num_train_epochs $EPOCH
--negatives_cross_device
--train_text_tower False
--train_vision_tower True
--resume_path $RESUME_PATH
--save_steps $SAVE_STEPS
--deepspeed $DeepSpeedConfig
--gradient_checkpointing
...

@JUNJIE99
Copy link
Owner

JUNJIE99 commented Oct 9, 2024

Hi,

During the two-stage training process of VISTA, the Text Encoder remains locked. This is because our goal is to introduce visual capabilities to a powerful general-purpose text embedding model without affecting its original text embedding capabilities. We believe that text embedding capability remains a fundamental ability in multimodal retrieval.

We believe our experiments in various multimodal retrieval tasks have demonstrated that using a pre-trained text encoder to handle mixed modality token sequences is effective. Our model has achieved state-of-the-art performance in zero-shot multimodal retrieval tasks, especially in text-heavy tasks such as WebQA and ReMuQ.

@CarllllWang
Copy link
Author

Thank you very much for your response. Let's consider a approach: could we first fine-tune a pure text BGE model (such as bge-large-zh-v1.5), and then use the fine-tuned BGE as the text tower for Stage 2 training to fine-tune the vision tower?
For my data. The query is single-modal, while the items actually have both text and image modalities, similar to the T2IT task . My original approach was single-modal text retrieval, which causes some products with visual information to completely miss out on their image features. Therefore, I hope to achieve better performance than the text BGE through VISTA. Would this fine-tuned model perform better in multimodal retrieval (with text as the query and items consisting of text + images) compared to the original pure text BGE model?

@JUNJIE99
Copy link
Owner

JUNJIE99 commented Oct 9, 2024

If you fine-tune the BGE Text Encoder first, I believe you still need to perform image-text pre-training (i.e., the first stage training in our paper) to align the visual encoder with your newly trained text encoder. This is because the second stage training data (e.g., T2IT) is usually not large enough to align a visual encoder to a new text encoder space. Therefore, using large-scale image-text data for alignment is necessary.

Of course, you can directly load VISTA's visual encoder into your fine-tuned BGE Text Encoder, but I'm not sure if your data scale is sufficient to help the visual encoder align with the text embedding space.

@CarllllWang
Copy link
Author

Thank you for your response. As you mentioned, the first stage of training seems necessary. However, I noticed in the "train.bash" of your "stage2_training_code" that the loaded BGE model appears to be an original "BAAI/bge-base-en-v1.5," which has not undergone the first stage of training. This suggests that the model is a pure text pre-trained BGE model, but it is fixed during the second stage of training. If we train it this way, doesn't that mean we are using text encoder model used for stage-2 is not aligned with visual features?

@JUNJIE99
Copy link
Owner

JUNJIE99 commented Oct 9, 2024

Sorry for any misunderstanding caused by the training code not being as tidy. The Automodel.from_pretrained() method is used in the model initialization function merely to initialize the model architecture; the weights will actually be overwritten. As you can see in line 116 of the run_stage2_fusion.py file, we reload the VISTA weights (i.e., the model weights obtained from the first stage).

For a more streamlined approach, you can refer to the model initialization method in the main repository here that uses the config file to initialize the model architecture and avoids downloading the weight files repeatedly.

@CarllllWang
Copy link
Author

Thank you very much for your response. I didn't see that there is a model parameter loading here. Should the RESUME_PATH refer to the model trained in the first stage? Is it possible to provide a trained text encoder that has already been aligned with visual features? Furthermore, does this mean that the original BGE text model can no longer be fine-tuned? For my data, the originally pre-trained BGE also needs fine-tuning to achieve better results. By loading the RESUME_PATH, it implies that the BGE model that hasn’t been trained by my down data is locked during the subsequent second stage training.

@JUNJIE99
Copy link
Owner

JUNJIE99 commented Oct 9, 2024

Yes, the RESUME_PATH refers to the model weights from the first stage that need to be loaded during the second stage of training.

Regarding your question, "Is it possible to provide a trained text encoder that has already been aligned with visual features?" In our approach, we align the visual encoder to the pre-trained text encoder, not the other way around.

As for whether the original BGE text model can still be fine-tuned, I believe it can. However, the visual encoder weights we provide are aligned with the original BGE model. This means that our provided weights cannot be directly used with your fine-tuned model. If your fine-tuning of the BGE model is minimal, you might be able to achieve a good alignment again by re-aligning the visual encoder with your fine-tuned BGE model using relatively little data. However, the effectiveness of this approach depends on the extent of your prior fine-tuning and the amount of data available for re-alignment. So, while theoretically possible, I cannot guarantee it will work effectively in your specific case.

I hope this helps clarify things for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants