You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Greate work!
I am also very interested in your work. Recently, I tried to reproduce the work on video modality alignment. I used the pre-trained ViT-b32 of OpenAI for initialization. The visual encoder part uses temporal attention to model the temporal relationship. During training, the text encoder is fixed, and only the weights of the embedding layer and the temporal attention part of the visual encoder will be updated. During training, the loss of the model dropped from 5.9 to 5.2. If both the visual encoder and the text encoder are all fine-tuned, the loss can be reduced to about 0.3. For this situation where only some parameters of the visual encoder are fine-tuned, the loss converges poorly. I wonder if you have encountered this during training? What should I pay attention to when using this fine-tuning method?
The text was updated successfully, but these errors were encountered:
Greate work!
I am also very interested in your work. Recently, I tried to reproduce the work on video modality alignment. I used the pre-trained ViT-b32 of OpenAI for initialization. The visual encoder part uses temporal attention to model the temporal relationship. During training, the text encoder is fixed, and only the weights of the embedding layer and the temporal attention part of the visual encoder will be updated. During training, the loss of the model dropped from 5.9 to 5.2. If both the visual encoder and the text encoder are all fine-tuned, the loss can be reduced to about 0.3. For this situation where only some parameters of the visual encoder are fine-tuned, the loss converges poorly. I wonder if you have encountered this during training? What should I pay attention to when using this fine-tuning method?
The text was updated successfully, but these errors were encountered: