Pretraining on video dataset without lora. #54

shihuai · 2024-05-29T08:13:24Z

Greate work！
I am also very interested in your work. Recently, I tried to reproduce the work on video modality alignment. I used the pre-trained ViT-b32 of OpenAI for initialization. The visual encoder part uses temporal attention to model the temporal relationship. During training, the text encoder is fixed, and only the weights of the embedding layer and the temporal attention part of the visual encoder will be updated. During training, the loss of the model dropped from 5.9 to 5.2. If both the visual encoder and the text encoder are all fine-tuned, the loss can be reduced to about 0.3. For this situation where only some parameters of the visual encoder are fine-tuned, the loss converges poorly. I wonder if you have encountered this during training? What should I pay attention to when using this fine-tuning method?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pretraining on video dataset without lora. #54

Pretraining on video dataset without lora. #54

shihuai commented May 29, 2024

Pretraining on video dataset without lora. #54

Pretraining on video dataset without lora. #54

Comments

shihuai commented May 29, 2024