Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pretraining on video dataset without lora. #54

Open
shihuai opened this issue May 29, 2024 · 0 comments
Open

Pretraining on video dataset without lora. #54

shihuai opened this issue May 29, 2024 · 0 comments

Comments

@shihuai
Copy link

shihuai commented May 29, 2024

Greate work!
I am also very interested in your work. Recently, I tried to reproduce the work on video modality alignment. I used the pre-trained ViT-b32 of OpenAI for initialization. The visual encoder part uses temporal attention to model the temporal relationship. During training, the text encoder is fixed, and only the weights of the embedding layer and the temporal attention part of the visual encoder will be updated. During training, the loss of the model dropped from 5.9 to 5.2. If both the visual encoder and the text encoder are all fine-tuned, the loss can be reduced to about 0.3. For this situation where only some parameters of the visual encoder are fine-tuned, the loss converges poorly. I wonder if you have encountered this during training? What should I pay attention to when using this fine-tuning method?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant