Kashu Yamazaki · Khoa Vo · Sang Truong · Bhiksha Raj · Ngan Le
Paper | arXiv | Project Page
Video paragraph captioning aims to generate a multi-sentence description of an untrimmed video with several temporal event locations in coherent storytelling. Following the human perception process, where the scene is effectively understood by decomposing it into visual (e.g. human, animal) and non-visual components (e.g. action, relations) under the mutual influence of vision and language, we first propose a visual-linguistic (VL) feature. In the proposed VL feature, the scene is modeled by three modalities including (i) a global visual environment; (ii) local visual main agents; (iii) linguistic scene elements. We then introduce an autoregressive Transformer-in-Transformer (TinT) to simultaneously capture the semantic coherence of intra- and inter-event contents within a video. Finally, we present a new VL contrastive loss function to guarantee learnt embedding features are matched with the captions semantics. Comprehensive experiments and extensive ablation studies on ActivityNet Captions and YouCookII datasets show that the proposed Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms prior state-of-the-art methods on accuracy and diversity.
- Clone this repository
git clone https://github.com/UARK-AICV/VLTinT.git
cd VLTinT
- Prepare Conda environment
conda env create -f environment.yml
conda activate pytorch
- Add project root to
PYTHONPATH
Note that you need to do this each time you start a new session.
source setup.sh
We assume to have following file structure after this preparation.
If you want to change the file structure, please modify the
data_path
insrc/rtransformer/recursive_caption_dataset.py
cache
|- anet_vocab_clip.pt
|- anet_word2idx.json
|- yc2_vocab_clip.pt
|_ yc2_word2idx.json
data
|- anet
| |- c3d_env
| |- c3d_agent
| |_ clip_b16
| |- lang_feature
| |_ sent_feature
densevid_eval
preprocess
scripts
src
video_feature
|- anet_duration_frame.csv
|_ yc2_duration_frame.csv
Our features extracted from rescaled videos of ActivityNet-1.3 can be downloaded below:
You can use our preprocessed features above or process by yourself as follows:
1. Download data
-
Download raw videos of ActivityNet and YouCook2 and convert all the videos into
mp4
for the later process (you needffmpeg
for the script below).python preprocess/convert_to_mp4.py --video-root path/to/video/dir --output-root path/to/dir/*.mp4
-
Rescale each video into 1600 frames and extract the middle frame of every 16 frames (100 middle frames will be extracted).
python preprocess/rescale_video.py --video-root path/to/dir/*.mp4 --output-root path/to/dir/rescaled --frame-dir path/to/dir/middle_frames
2. Env feature extraction
- To extract the visual features from the rescaled videos, we will use this repo.
Then, run the following command to extract the env features.
git clone https://github.com/vhvkhoa/SlowFast cd SlowFast python setup.py build develop
python tools/run_net.py --cfg configs/Kinetics/SLOWONLY_8x8_R50.yaml --feature_extraction --num_features 100 --video_dir path/to/dir/rescaled --feat_dir path/to/data/[anet/yc2]/c3d_env TEST.CHECKPOINT_FILE_PATH models/SLOWONLY_8x8_R50.pkl NUM_GPUS 1 TEST.CHECKPOINT_TYPE caffe2 TEST.BATCH_SIZE 1 DATA.SAMPLING_RATE 1 DATA.NUM_FRAMES 16 DATA_LOADER.NUM_WORKERS 0
3. Agent feature extraction
- To extract the agent features, we will use detectron for bbox detection.
Then follow the command below to extract the agent features.
git clone https://github.com/vhvkhoa/detectron2 python -m pip install -e detectron2 wget https://dl.fbaipublicfiles.com/detectron2/COCO-Detection/faster_rcnn_R_101_FPN_3x/137851257/model_final_f6e8b1.pkl python tools/bbox_extract.py path/to/dir/rescaled path/to/dir/bbox --config-file configs/COCO-Detection/faster_rcnn_R_101_FPN_3x.yaml --sampling-rate 16 --target-frames 100 --opts MODEL.WEIGHTS model_final_f6e8b1.pkl
cd SlowFast python tools/run_net.py --cfg configs/Kinetics/SLOWONLY_8x8_R50.yaml --feature_extraction --num_features 100 --video_dir path/to/dir/rescaled --feat_dir path/to/data/[anet/yc2]/c3d_agent MODEL.NUM_CLASSES 200 TEST.CHECKPOINT_TYPE caffe2 TEST.CHECKPOINT_FILE_PATH models/SLOWONLY_8x8_R50.pkl NUM_GPUS 1 TEST.BATCH_SIZE 1 DATA.PATH_TO_BBOX_DIR path/to/dir/bbox DETECTION.ENABLE True DETECTION.SPATIAL_SCALE_FACTOR 32 DATA.SAMPLING_RATE 1 DATA.NUM_FRAMES 16 RESNET.SPATIAL_STRIDES [[1],[2],[2],[1]] RESNET.SPATIAL_DILATIONS [[1],[1],[1],[2]] DATA.PATH_TO_TMP_DIR /tmp/agent_0/
4. Lang feature extraction
-
To extract the linguistic features from those videos, run the following commands. Change
--dset_name
toanet
oryc2
to specify the dataset.python preprocess/build_lang_feat_vocab.py --dset_name [anet/yc2] python preprocess/extract_lang_feat.py --frame-root path/to/dir/middle_frames --output-root path/to/data/[anet/yc2]/clip_b16/lang_feature --dset_name [anet/yc2] python preprocess/extract_sent_feat.py --caption_root ./densevid_eval/[anet/yc2]_data/train.json --output_root path/to/data/[anet/yc2]/clip_b16/sent_features
Execute following command to create a vocablary for the model. Change --dset_name
to anet
or yc2
to specify the dataset.
python preprocess/build_vocab.py --dset_name [anet/yc2] --min_word_count 1
python preprocess/extract_vocab_emb.py --dset_name [anet/yc2]
To train our TinT model on ActivityNet Captions:
bash scripts/train.sh [anet/yc2] --use_env --use_agent --use_lang --use_tint
Here you can specify the dataset (ActivityNet:anet
or YouCook2:yc2
), whether to use each features (e.g., env, agent, and language), and whether to use the proposed TinT decoder (if turned off, run with MART decoder).
Training log and model will be saved at results/anet_re_*
.
Once you have a trained model, you can follow the instructions below to generate captions.
- Generate captions
bash scripts/translate_greedy.sh anet_re_* [val/test]
Replace anet_re_*
with your own model directory name.
The generated captions are saved at results/anet_re_*/greedy_pred_[val/test].json
- Evaluate generated captions
bash scripts/eval.sh anet [val/test] results/anet_re_*/greedy_pred_[val/test].json
The results should be comparable with the results of the paper.
To visualize the result with the video (mp4):
python visualization/demo.py --input_mp4_folder /path/to/folder/contains/mp4 --output_mp4_folder /path/to/output/ --caption_file results/anet_re_*/greedy_pred_val.json --video_id v_5qsXmDi8d74
If you find this code useful for your research, please cite our papers:
@article{kashu_vltint,
title={VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning},
volume={37},
url={https://ojs.aaai.org/index.php/AAAI/article/view/25412},
DOI={10.1609/aaai.v37i3.25412},
number={3},
journal={Proceedings of the AAAI Conference on Artificial Intelligence},
author={Yamazaki, Kashu and Vo, Khoa and Truong, Quang Sang and Raj, Bhiksha and Le, Ngan},
year={2023},
month={Jun.},
pages={3081-3090}
}
@INPROCEEDINGS{kashu_vlcap,
author={Yamazaki, Kashu and Truong, Sang and Vo, Khoa and Kidd, Michael and Rainwater, Chase and Luu, Khoa and Le, Ngan},
booktitle={2022 IEEE International Conference on Image Processing (ICIP)},
title={VLCAP: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning},
year={2022},
volume={},
number={},
pages={3656-3661},
doi={10.1109/ICIP46576.2022.9897766}}
We acknowledge the following open-source projects that we based on our work:
Please feel free to create an issue on this repository for any questions, comments and reporting bugs. Send an email to Kashu Yamazaki for other inquiries.