This is an official implementation of the multi-modalities downstrem tasks in InternVideo, including zero-shot action recognition, zero-shot multiple choice, and video question answering.
We currently provide the B/16 model, please download the model from aliyun. You will also need original CLIP model ViT-B-16. Please modify the /path/to/model in the scripts accordingly.
The code is mostly based on All-In-One. Please follow it for installation and data preparation.
Please follow scripts/zs_classify.sh
. We provide results on Kinetics-400 test set.
B/16 | L/14 | |
---|---|---|
K400 | 56.65 | 64.25 |
Please follow scripts/zs_choice_[dataset].sh
. We provide results on MSRVTT and LSMDC.
B/16 | L/14 | |
---|---|---|
MSRVTT | 91.31 | 93.44 |
LSMDC | 73.96 | 77.26 |
Please follow scripts/finetune_[dataset].sh
. We provide results on MSRVTT, MSVD, and TGIF-FrameQA.
B/16 | L/14 | |
---|---|---|
MSRVTT | 44.58 | 47.14 |
MSVD | 51.77 | 55.54 |
TGIF-Frame | 67.83 | 72.22 |
The L/14 model is on its way.
This repo is built based on All-In-One, CLIP, CoCa and open_clip.