Please fill out VideoMAE V2 Download Request Form, you will see the download link for the VideoMAE V2 model weights after submission. The form asks for some information about your organization and how you plan to use the model, so that we can better understand the needs of our users and improve our future works.
The weights of the distilled models can be downloaded directly at Distillation section.
Model |
Config |
Dataset |
Encoder Masking |
Decoder Masking |
Epoch |
#Frame |
ViT-giant |
vit_g_hybrid_pt_1200e |
UnlabeledHybrid |
tube (90%) |
running cell (50%) |
1200 |
16 |
- We set different sampling intervals for the videos from different sources in unlabeledhybrid: 2 for SSv2 and 4 for the other datasets.
- We report the fine-tuning accuracy for sparse sampling on SSv2 and for dense sampling on the other datasets.
- #Frame = #input_frame x #clip x #crop.
- all the input resolution is $224^2$.
Model |
Dataset |
Teacher Model |
#Frame |
K710 Top-1 |
K400 Top-1 |
K600 Top-1 |
Checkpoint |
ViT-small |
K710 |
vit_g_hybrid_pt_1200e_k710_ft |
16x5x3 |
77.6 |
83.7 |
83.1 |
vit_s_k710_dl_from_giant.pth |
|
|
fine-tuning accuracy |
16x7x3 |
-- |
84.0 |
84.6 |
-- |
ViT-base |
K710 |
vit_g_hybrid_pt_1200e_k710_ft |
16x5x3 |
81.5 |
86.6 |
85.9 |
vit_b_k710_dl_from_giant.pth |
|
|
fine-tuning accuracy |
16x7x3 |
-- |
87.1 |
87.4 |
|
- We initialize the parameters of the student model with the model obtained after the post-pre-train stage.
- The fine-tuning accuracy refers to the accuracy achieved by further fine-tuning several epochs in the specified dataset after distillation.