Paper | Supplementary Material
Haopeng Li, Qiuhong Ke, Mingming Gong, Tom Drummond
IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2023
We propose a multimodal self-supervised learning framework to obtain semantic representations of videos, which benefits the video summarization task.
Specifically, the self-supervised learning is conducted by exploring the semantic consistency between the videos and text in both coarse-grained and fine-grained fashions, as well as recovering masked frames in the videos.
The multimodal framework is trained on a newly-collected dataset that consists of video-text pairs.
Additionally, we introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries.
- python=3.8.13
- pytorch=1.12, ortools=9.3.10497
- pytorch-lightning=1.6.5
- pytorch-transformers=1.2.0
Download the pretrained model to the root dictionary.
OR
Follow the following steps to train the self-supervised model.
Download the visual features and text information embeddings of the YTVT dataset and uncompress them to ssl/features/
and ssl/info_embed/
, respectively.
Run the following command in ssl/
to train the self-supervised model:
$ CUDA_VISIBLE_DEVICES=0,1 python main_ssl.py --config ssl.yaml
The trained model is saved in ssl/results/SSL/checkpoints/
.
Download the data and uncompress it to data/
.
Run the following command in the root dictionary to train the video summarization model:
$ sh main.sh CFG_FILE
where CFG_FILE
is a configuration file (*.yaml
) for different settings. We provide several configuration files in cfgs/
. Here is an example for training the model on SumMe in the augmented setting:
$ sh main.sh cfgs/sm_a.yaml
If you pretrain the model yourself, change resume
in CFG_FILE
to the model saved in ssl/results/SSL/checkpoints/
. The results of video summarization are recoded in records.csv
.
We provide the original videos and text information of YTVT here. Besides, we also provide the re-collected text information of SumMe and TVSum here.
The use of this code is RESTRICTED to non-commercial research and educational purposes.
If you use this code or reference our paper in your work please cite this publication as:
@inproceedings{li2023progressive,
title={Progressive Video Summarization via Multimodal Self-supervised Learning},
author={Li, Haopeng and Ke, Qiuhong and Gong, Mingming and Drummond, Tom},
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
pages={5584--5593},
year={2023}
}
The code is developed based on VASNet.