VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning

Kashu Yamazaki · Khoa Vo · Sang Truong · Bhiksha Raj · Ngan Le

Paper | arXiv | Project Page

Video paragraph captioning aims to generate a multi-sentence description of an untrimmed video with several temporal event locations in coherent storytelling. Following the human perception process, where the scene is effectively understood by decomposing it into visual (e.g. human, animal) and non-visual components (e.g. action, relations) under the mutual influence of vision and language, we first propose a visual-linguistic (VL) feature. In the proposed VL feature, the scene is modeled by three modalities including (i) a global visual environment; (ii) local visual main agents; (iii) linguistic scene elements. We then introduce an autoregressive Transformer-in-Transformer (TinT) to simultaneously capture the semantic coherence of intra- and inter-event contents within a video. Finally, we present a new VL contrastive loss function to guarantee learnt embedding features are matched with the captions semantics. Comprehensive experiments and extensive ablation studies on ActivityNet Captions and YouCookII datasets show that the proposed Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms prior state-of-the-art methods on accuracy and diversity.

Environment Setup

Clone this repository

git clone https://github.com/UARK-AICV/VLTinT.git
cd VLTinT

Prepare Conda environment

conda env create -f environment.yml
conda activate pytorch

Add project root to PYTHONPATH

Note that you need to do this each time you start a new session.

source setup.sh

Data preparation

We assume to have following file structure after this preparation.

If you want to change the file structure, please modify the data_path in src/rtransformer/recursive_caption_dataset.py

cache
  |- anet_vocab_clip.pt
  |- anet_word2idx.json
  |- yc2_vocab_clip.pt
  |_ yc2_word2idx.json
data
  |- anet
  |   |- c3d_env
  |   |- c3d_agent
  |   |_ clip_b16
  |       |- lang_feature
  |       |_ sent_feature
densevid_eval
preprocess
scripts
src
video_feature
  |- anet_duration_frame.csv
  |_ yc2_duration_frame.csv

Our features extracted from rescaled videos of ActivityNet-1.3 can be downloaded below:

Env features are here.
Agent features are here or here.
Lang features are here.

You can use our preprocessed features above or process by yourself as follows:

1. Download data

Download raw videos of ActivityNet and YouCook2 and convert all the videos into mp4 for the later process (you need ffmpeg for the script below).
```
python preprocess/convert_to_mp4.py --video-root path/to/video/dir --output-root path/to/dir/*.mp4
```

Rescale each video into 1600 frames and extract the middle frame of every 16 frames (100 middle frames will be extracted).

python preprocess/rescale_video.py --video-root path/to/dir/*.mp4 --output-root path/to/dir/rescaled --frame-dir path/to/dir/middle_frames

2. Env feature extraction

To extract the visual features from the rescaled videos, we will use this repo.

git clone https://github.com/vhvkhoa/SlowFast
cd SlowFast
python setup.py build develop

Then, run the following command to extract the env features.

python tools/run_net.py --cfg configs/Kinetics/SLOWONLY_8x8_R50.yaml --feature_extraction --num_features 100 --video_dir path/to/dir/rescaled --feat_dir path/to/data/[anet/yc2]/c3d_env TEST.CHECKPOINT_FILE_PATH models/SLOWONLY_8x8_R50.pkl NUM_GPUS 1 TEST.CHECKPOINT_TYPE caffe2 TEST.BATCH_SIZE 1 DATA.SAMPLING_RATE 1 DATA.NUM_FRAMES 16 DATA_LOADER.NUM_WORKERS 0

3. Agent feature extraction

To extract the agent features, we will use detectron for bbox detection.

git clone https://github.com/vhvkhoa/detectron2
python -m pip install -e detectron2
wget https://dl.fbaipublicfiles.com/detectron2/COCO-Detection/faster_rcnn_R_101_FPN_3x/137851257/model_final_f6e8b1.pkl
python tools/bbox_extract.py path/to/dir/rescaled path/to/dir/bbox --config-file configs/COCO-Detection/faster_rcnn_R_101_FPN_3x.yaml --sampling-rate 16 --target-frames 100 --opts MODEL.WEIGHTS model_final_f6e8b1.pkl

Then follow the command below to extract the agent features.

cd SlowFast
python tools/run_net.py --cfg configs/Kinetics/SLOWONLY_8x8_R50.yaml --feature_extraction --num_features 100 --video_dir path/to/dir/rescaled --feat_dir path/to/data/[anet/yc2]/c3d_agent MODEL.NUM_CLASSES 200 TEST.CHECKPOINT_TYPE caffe2 TEST.CHECKPOINT_FILE_PATH models/SLOWONLY_8x8_R50.pkl NUM_GPUS 1 TEST.BATCH_SIZE 1 DATA.PATH_TO_BBOX_DIR path/to/dir/bbox DETECTION.ENABLE True DETECTION.SPATIAL_SCALE_FACTOR 32 DATA.SAMPLING_RATE 1 DATA.NUM_FRAMES 16 RESNET.SPATIAL_STRIDES [[1],[2],[2],[1]] RESNET.SPATIAL_DILATIONS [[1],[1],[1],[2]] DATA.PATH_TO_TMP_DIR /tmp/agent_0/

4. Lang feature extraction

To extract the linguistic features from those videos, run the following commands. Change --dset_name to anet or yc2 to specify the dataset.

python preprocess/build_lang_feat_vocab.py --dset_name [anet/yc2]
python preprocess/extract_lang_feat.py --frame-root path/to/dir/middle_frames --output-root path/to/data/[anet/yc2]/clip_b16/lang_feature --dset_name [anet/yc2]
python preprocess/extract_sent_feat.py --caption_root ./densevid_eval/[anet/yc2]_data/train.json --output_root path/to/data/[anet/yc2]/clip_b16/sent_features

Build Vocabularies

Execute following command to create a vocablary for the model. Change --dset_name to anet or yc2 to specify the dataset.

python preprocess/build_vocab.py --dset_name [anet/yc2] --min_word_count 1
python preprocess/extract_vocab_emb.py --dset_name [anet/yc2]

Training

To train our TinT model on ActivityNet Captions:

bash scripts/train.sh [anet/yc2] --use_env --use_agent --use_lang --use_tint

Here you can specify the dataset (ActivityNet:anet or YouCook2:yc2), whether to use each features (e.g., env, agent, and language), and whether to use the proposed TinT decoder (if turned off, run with MART decoder).

Training log and model will be saved at results/anet_re_*.
Once you have a trained model, you can follow the instructions below to generate captions.

Evaluation

Generate captions

bash scripts/translate_greedy.sh anet_re_* [val/test]

Replace anet_re_* with your own model directory name. The generated captions are saved at results/anet_re_*/greedy_pred_[val/test].json

Evaluate generated captions

bash scripts/eval.sh anet [val/test] results/anet_re_*/greedy_pred_[val/test].json

The results should be comparable with the results of the paper.

Visualization

To visualize the result with the video (mp4):

python visualization/demo.py --input_mp4_folder /path/to/folder/contains/mp4 --output_mp4_folder /path/to/output/ --caption_file results/anet_re_*/greedy_pred_val.json --video_id v_5qsXmDi8d74

Demo

Citation

If you find this code useful for your research, please cite our papers:

@article{kashu_vltint,
　　title={VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning},
　　volume={37},
　　url={https://ojs.aaai.org/index.php/AAAI/article/view/25412},
　　DOI={10.1609/aaai.v37i3.25412},
　　number={3},
　　journal={Proceedings of the AAAI Conference on Artificial Intelligence},
　　author={Yamazaki, Kashu and Vo, Khoa and Truong, Quang Sang and Raj, Bhiksha and Le, Ngan},
　　year={2023},
　　month={Jun.},
　　pages={3081-3090}
}

@INPROCEEDINGS{kashu_vlcap,
  author={Yamazaki, Kashu and Truong, Sang and Vo, Khoa and Kidd, Michael and Rainwater, Chase and Luu, Khoa and Le, Ngan},
  booktitle={2022 IEEE International Conference on Image Processing (ICIP)}, 
  title={VLCAP: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning}, 
  year={2022},
  volume={},
  number={},
  pages={3656-3661},
  doi={10.1109/ICIP46576.2022.9897766}}

Acknowledgement

We acknowledge the following open-source projects that we based on our work:

MART

Contact

Please feel free to create an issue on this repository for any questions, comments and reporting bugs. Send an email to Kashu Yamazaki for other inquiries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning

Paper | arXiv | Project Page

Environment Setup

Data preparation

Build Vocabularies

Training

Evaluation

Visualization

Demo

Citation

Acknowledgement

Contact

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
assets		assets
densevid_eval		densevid_eval
preprocess		preprocess
scripts		scripts
src		src
video_feature		video_feature
visualization		visualization
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
setup.sh		setup.sh

UARK-AICV/VLTinT

Folders and files

Latest commit

History

Repository files navigation

VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning

Paper | arXiv | Project Page

Environment Setup

Data preparation

Build Vocabularies

Training

Evaluation

Visualization

Demo

Citation

Acknowledgement

Contact

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages