Skip to content
/ CNMT Public

[AAAI 2021] Confidence-aware Non-repetitive Multimodal Transformers for TextCaps

License

Notifications You must be signed in to change notification settings

wzk1015/CNMT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

Code for our AAAI 2021 paper Confidence-aware Non-repetitive Multimodal Transformers for TextCaps [PDF].

Installation

Our implementation is based on Pythia framework (now called mmf), and built upon M4C-Captioner. Please refer to Pythia's document for more details on installation requirements.

# install pythia based on requirements.txt
python setup.py build develop  

Data Preparation

The following is open-source data of TextCaps dataset from M4C-Captioner's Github repository. Please download them from the links below and and extract them under data directory.

Our imdb files include new OCR tokens and recognition confidence extracted with pretrained OCR systems ( CRAFT, ABCNet and four-stage STR). The three imdb files should be downloaded from the links below and put under data/imdb/.

file name download link
imdb_train.npy Google Drive Baidu Netdisk(password: sxbk)
imdb_val_filtered_by_image_id.npy Google Drive Baidu Netdisk(password: i6pf)
imdb_test_filtered_by_image_id.npy Google Drive Baidu Netdisk(password: uxew)

Finally, your data directory structure should look like this:

data
|-detectron							
|---...
|-m4c_textvqa_ocr_en_frcn_features
|---...
|-open_images						
|---...
|-vocab_textcap_threshold_10.txt 	#already provided
|-imdb								
|---imdb_train.npy					
|---imdb_val_filtered_by_image_id.npy	
|---imdb_test_filtered_by_image_id.npy		

Pretrained Model

download link description val set CIDEr test set CIDEr
Google Drive Baidu Netdisk(password: c4be) CNMT best 101.6 93.0

Training

We provide an example script for training on TextCaps dataset for 12000 iterations and evaluating every 500 iterations.

./train.sh

This may take approximately 13 hours, depending on GPU devices. Please refer to our paper for implementation details.

First-time training will download fasttext model . You may also download it manually and put it under pythia/.vector_cache/.

During training, log file can be found under save/cnmt/m4c_textcaps_cnmt/logs/. You may also run training in background and check log file for training status.

Evaluation

Assume that checkpoint of the trained model is saved at save/cnmt/m4c_textcaps_cnmt/best.ckpt (otherwise modify the resume_file parameter in the shell script).

Run the following script to generate prediction json file:

#evaluate on validation set
./eval_val.sh 
#evaluate on test set
./eval_test.sh

The prediction json file will be saved under save/eval/m4c_textcaps_cnmt/reports/. You can submit the json file to the TextCaps EvalAI server for result.

Citation

@article{wang2020confidenceaware,
  title={Confidence-aware Non-repetitive Multimodal Transformers for TextCaps}, 
  author={Wang, Zhaokai and Bao, Renda and Wu, Qi and Liu, Si},
  year={2020},
  journal={arXiv preprint arXiv:2012.03662},
}

About

[AAAI 2021] Confidence-aware Non-repetitive Multimodal Transformers for TextCaps

Topics

Resources

License

Stars

Watchers

Forks

Languages