This is an official implementation of the multi-modalities pre-training model in InternVideo, which is resposible for multi-modalities tasks including zero-shot action recognition, zero-shot multiple choice, zero-shot retrieval, video question answering, video-text retrieval, and also one of the component in the final InternVideo model.
We currently provide the B/16 model, please download the model from aliyun and place them under folder models
. The model uses UniformerV2 as backbone, and is trained for 12 days using 128 NVIDIA A100 GPUs.
To classify the demo video of an airplane taking off, run python demo.py
, and hopefully you'll see the results as (L/14 model)
Label probs:
an airplane is taking off : 0.9562
an airplane is flying : 0.0438
a dog is chasing a ball : 0.0000
This folder aims at providing a minimal inference implementation for easier usage. For training and fine-tuning for downstream tasks, please refer to other specific folders.
If you intend to use InternVideo for your own video-language tasks, use the video encoder and text encoder only for alignment tasks such as retrieval, and use the features from cross-modality decoder at the same time if your task involves modalities fusion such as video question answering.
The training code and L/14 model is on its way.