Name		Name	Last commit message	Last commit date
parent directory ..
CoTrain		CoTrain
scripts		scripts
README.md		README.md
requirement.txt		requirement.txt
run.py		run.py

README.md

Multi-Modalities-Downstream

This is an official implementation of the multi-modalities downstrem tasks in InternVideo, including zero-shot action recognition, zero-shot multiple choice, and video question answering.

Usage

Pre-trained model preparation

We currently provide the B/16 model, please download the model from aliyun. You will also need original CLIP model ViT-B-16. Please modify the /path/to/model in the scripts accordingly.

Installation and data preparation

The code is mostly based on All-In-One. Please follow it for installation and data preparation.

Downstream tasks

Zero-shot action recognition

Please follow scripts/zs_classify.sh. We provide results on Kinetics-400 test set.

	B/16	L/14
K400	56.65	64.25

Zero-shot multiple choice

Please follow scripts/zs_choice_[dataset].sh. We provide results on MSRVTT and LSMDC.

	B/16	L/14
MSRVTT	91.31	93.44
LSMDC	73.96	77.26

Video question answering

Please follow scripts/finetune_[dataset].sh. We provide results on MSRVTT, MSVD, and TGIF-FrameQA.

	B/16	L/14
MSRVTT	44.58	47.14
MSVD	51.77	55.54
TGIF-Frame	67.83	72.22

TODO

The L/14 model is on its way.

Acknowledgement

This repo is built based on All-In-One, CLIP, CoCa and open_clip.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-modalities-downstream

multi-modalities-downstream

README.md

Multi-Modalities-Downstream

Usage

Pre-trained model preparation

Installation and data preparation

Downstream tasks

Zero-shot action recognition

Zero-shot multiple choice

Video question answering

TODO

Acknowledgement

Files

multi-modalities-downstream

Directory actions

More options

Directory actions

More options

Latest commit

History

multi-modalities-downstream

Folders and files

parent directory

README.md

Multi-Modalities-Downstream

Usage

Pre-trained model preparation

Installation and data preparation

Downstream tasks

Zero-shot action recognition

Zero-shot multiple choice

Video question answering

TODO

Acknowledgement