Code for our AAAI 2021 paper Confidence-aware Non-repetitive Multimodal Transformers for TextCaps [PDF].
Our implementation is based on Pythia framework (now called mmf), and built upon M4C-Captioner. Please refer to Pythia's document for more details on installation requirements.
# install pythia based on requirements.txt
python setup.py build develop
The following is open-source data of TextCaps dataset from M4C-Captioner's Github repository. Please download them from the links below and and extract them under data
directory.
- object Faster R-CNN Features of TextCaps
- OCR Faster R-CNN Features of TextCaps
- detectron weights of TextCaps
Our imdb
files include new OCR tokens and recognition confidence extracted with pretrained OCR systems ( CRAFT, ABCNet and four-stage STR). The three imdb files should be downloaded from the links below and put under data/imdb/
.
file name | download link |
---|---|
imdb_train.npy | Google Drive Baidu Netdisk(password: sxbk) |
imdb_val_filtered_by_image_id.npy | Google Drive Baidu Netdisk(password: i6pf) |
imdb_test_filtered_by_image_id.npy | Google Drive Baidu Netdisk(password: uxew) |
Finally, your data
directory structure should look like this:
data
|-detectron
|---...
|-m4c_textvqa_ocr_en_frcn_features
|---...
|-open_images
|---...
|-vocab_textcap_threshold_10.txt #already provided
|-imdb
|---imdb_train.npy
|---imdb_val_filtered_by_image_id.npy
|---imdb_test_filtered_by_image_id.npy
download link | description | val set CIDEr | test set CIDEr |
---|---|---|---|
Google Drive Baidu Netdisk(password: c4be) | CNMT best | 101.6 | 93.0 |
We provide an example script for training on TextCaps dataset for 12000 iterations and evaluating every 500 iterations.
./train.sh
This may take approximately 13 hours, depending on GPU devices. Please refer to our paper for implementation details.
First-time training will download fasttext
model . You may also download it manually and put it under pythia/.vector_cache/
.
During training, log file can be found under save/cnmt/m4c_textcaps_cnmt/logs/
. You may also run training in background and check log file for training status.
Assume that checkpoint of the trained model is saved at save/cnmt/m4c_textcaps_cnmt/best.ckpt
(otherwise modify the resume_file
parameter in the shell script).
Run the following script to generate prediction json file:
#evaluate on validation set
./eval_val.sh
#evaluate on test set
./eval_test.sh
The prediction json file will be saved under save/eval/m4c_textcaps_cnmt/reports/
. You can submit the json file to the TextCaps EvalAI server for result.
@article{wang2020confidenceaware,
title={Confidence-aware Non-repetitive Multimodal Transformers for TextCaps},
author={Wang, Zhaokai and Bao, Renda and Wu, Qi and Liu, Si},
year={2020},
journal={arXiv preprint arXiv:2012.03662},
}