A model achieves the state-of-the-art video-text retrieval performance on two settings, six datasets, and ten metrics.
Zero-Shot Video Retrieval
Dataset | Setting | R@1↑ | R@5↑ | R@10↑ | MedR↓ | MeanR↓ |
---|---|---|---|---|---|---|
MSRVTT | video-to-text | 37.5 | 63.3 | 71.3 | 3.0 | 24.2 |
text-to-video | 40 | 65.3 | 74.1 | 2.0 | 23.9 | |
MSVD | video-to-text | 67.6 | 90.6 | 94.6 | 1.0 | 2.9 |
text-to-video | 43.4 | 69.9 | 79.1 | 2.0 | 17.0 | |
LSMDC | video-to-text | 13.2 | 27.8 | 34.9 | 33.0 | 113.6 |
text-to-video | 17.6 | 32.4 | 40.2 | 23.0 | 101.7 | |
ActivityNet | video-to-text | 31.4 | 59.4 | 73.1 | 3.0 | 15.6 |
text-to-video | 30.7 | 57.4 | 70.2 | 4.0 | 23.0 | |
DiDeMo | video-to-text | 33.5 | 60.3 | 71.1 | 3.0 | 21.5 |
text-to-video | 31.5 | 57.6 | 68.2 | 3.0 | 35.7 | |
VATEX | video-to-text | 69.5 | 95 | 98.1 | 1.0 | 2.1 |
text-to-video | 49.5 | 79.7 | 87 | 2.0 | 9.7 |
Video Retrieval with Full Finetuning
Dataset | Setting | R@1↑ | R@5↑ | R@10↑ | MedR↓ | MeanR↓ |
---|---|---|---|---|---|---|
MSRVTT | video-to-text | 57.9 | 79.2 | 86.4 | 1.0 | 7.5 |
text-to-video | 55.2 | 79.6 | 87.5 | 1.0 | 10.7 | |
MSVD | video-to-text | 76.3 | 96.8 | 98.7 | 1.0 | 2.1 |
text-to-video | 58.4 | 84.5 | 90.4 | 1.0 | 8.2 | |
LSMDC | video-to-text | 34.9 | 54.6 | 63.1 | 4.0 | 32.9 |
text-to-video | 34.0 | 53.7 | 62.9 | 4.0 | 38.7 | |
ActivityNet | video-to-text | 62.8 | 86.2 | 93.3 | 1.0 | 3.5 |
text-to-video | 62.2 | 85.9 | 93.2 | 1.0 | 3.9 | |
DiDeMo | video-to-text | 59.1 | 81.8 | 89.0 | 1.0 | 7.2 |
text-to-video | 57.9 | 82.4 | 88.9 | 1.0 | 9.2 | |
VATEX | video-to-text | 86.0 | 99.2 | 99.6 | 1.0 | 1.3 |
text-to-video | 72.0 | 95.1 | 97.8 | 1.0 | 2.4 |
- CUDA Version 11.1
- PyTorch 1.8.1
- torchvision 0.9.0
- python 3.6.9
Download Original Dataset: MSR-VTT, MSVD, LSMDC, DiDeMo, ActivityNet, VATEX.
All annotation files can be downloaded from here. Unzip and put them in the data/ folder.
python preprocess/compress_video.py --input_root [raw_video_path] --output_root [compressed_video_path]
This script will compress the video to 3fps with width 224 (or height 224). Modify the variables for your customization.
The pre-trained ViT weights of CLIP can be found here: ViT-B/32, ViT-B/16, ViT-L/14.
Our fine-tuned retrieval checkpoint on each dataset will be released soon.
All zero-shot scripts are provided in the zeroshot_scripts/ folder. Be free to try different hyper-parameters.
./zeroshot_scripts/eval_msrvtt.sh
All fine-tune scripts are provided in the finetune_scripts/ folder, and the train_${dataset_name}.sh files are used to fine-tune our InternVideo model. Be free to try different hyper-parameters.
./finetune_scripts/train_msrvtt.sh
All the test scripts for evaluating the finetuned checkpoints are provided in the eval_finetuned_scripts/ folder. The scripts are slightly different from the zero-shot evaluation scripts.
./eval_finetuned_scripts/eval_finetuned_msrvtt.sh
This project is released under the MIT license.
Our codebase is based on CLIP4clip.