This example provides a minimal (2k lines) and faithful implementation of the following papers:
- Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
- Feature Pyramid Networks for Object Detection
- Mask R-CNN
- Cascade R-CNN: Delving into High Quality Object Detection
with the support of:
- Multi-GPU / distributed training
- Cross-GPU BatchNorm
- Group Normalization
- Python 3; OpenCV.
- TensorFlow >= 1.6 (1.4 or 1.5 can run but may crash due to a TF bug);
- pycocotools:
pip install 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'
- Pre-trained ImageNet ResNet model from tensorpack model zoo.
- COCO data. It needs to have the following directory structure:
COCO/DIR/
annotations/
instances_train201?.json
instances_val201?.json
train201?/
COCO_train201?_*.jpg
val201?/
COCO_val201?_*.jpg
You can use either the 2014 version or the 2017 version of the dataset.
To use the common "trainval35k + minival" split for the 2014 dataset, just
download the annotation files instances_minival2014.json
,
instances_valminusminival2014.json
from
here
to annotations/
as well.
Note that train2017==trainval35k==train2014+val2014-minival2014, and val2017==minival2014
On a single machine:
./train.py --config \
MODE_MASK=True MODE_FPN=True \
DATA.BASEDIR=/path/to/COCO/DIR \
BACKBONE.WEIGHTS=/path/to/ImageNet-R50-Pad.npz \
To run distributed training, set TRAINER=horovod
and refer to HorovodTrainer docs.
Options can be changed by either the command line or the config.py
file.
Recommended configurations are listed in the table below.
The code is only valid for training with 1, 2, 4 or >=8 GPUs. Not training with 8 GPUs may result in different performance from the table below.
To predict on an image (and show output in a window):
./train.py --predict input.jpg --load /path/to/model --config SAME-AS-TRAINING
To evaluate the performance of a model on COCO:
./train.py --evaluate output.json --load /path/to/COCO-R50C4-MaskRCNN-Standard.npz \
--config SAME-AS-TRAINING
Several trained models can be downloaded in the table below. Evaluation and prediction will need to be run with the corresponding training configs.
These models are trained on trainval35k and evaluated on minival2014 using mAP@IoU=0.50:0.95. Performance in Detectron can be roughly reproduced. Mask R-CNN results contain both box and mask mAP.
Backbone | mAP (box;mask) |
Detectron mAP 1 (box;mask) |
Time on 8 V100s | Configurations (click to expand) |
---|---|---|---|---|
R50-C4 | 33.1 | 18h | super quickMODE_MASK=False FRCNN.BATCH_PER_IM=64 PREPROC.SHORT_EDGE_SIZE=600 PREPROC.MAX_SIZE=1024 TRAIN.LR_SCHEDULE=[150000,230000,280000] |
|
R50-C4 | 36.6 | 36.5 | 44h | standardMODE_MASK=False |
R50-FPN | 37.4 | 37.9 | 29h | standardMODE_MASK=False MODE_FPN=True |
R50-C4 | 38.2;33.3 ⬇️ | 37.8;32.8 | 49h | standardthis is the default |
R50-FPN | 38.5;35.2 ⬇️ | 38.6;34.5 | 30h | standardMODE_FPN=True |
R50-FPN | 42.0;36.3 | 41h | +CascadeMODE_FPN=True FPN.CASCADE=True |
|
R50-FPN | 39.5;35.2 | 39.5;34.42 | 33h | +ConvGNHeadMODE_FPN=True FPN.FRCNN_HEAD_FUNC=fastrcnn_4conv1fc_gn_head |
R50-FPN | 40.0;36.2 ⬇️ | 40.3;35.7 | 40h | +GNMODE_FPN=True FPN.NORM=GN BACKBONE.NORM=GN FPN.FRCNN_HEAD_FUNC=fastrcnn_4conv1fc_gn_head FPN.MRCNN_HEAD_FUNC=maskrcnn_up4conv_gn_head |
R101-C4 | 41.4;35.2 ⬇️ | 60h | standardBACKBONE.RESNET_NUM_BLOCK=[3,4,23,3] |
|
R101-FPN | 40.4;36.6 ⬇️ | 40.9;36.4 | 38h | standardMODE_FPN=True BACKBONE.RESNET_NUM_BLOCK=[3,4,23,3] |
R101-FPN | 46.5;40.1 ⬇️ 3 | 73h | +++MODE_FPN=True FPN.CASCADE=True BACKBONE.RESNET_NUM_BLOCK=[3,4,23,3] TEST.RESULT_SCORE_THRESH=1e-4 PREPROC.TRAIN_SHORT_EDGE_SIZE=[640,800] TRAIN.LR_SCHEDULE=[420000,500000,540000] |
1: Here we comapre models that have identical training & inference cost between the two implementation. However their numbers are different due to many small implementation details.
2: Numbers taken from Group Normalization
3: Our mAP is 10+ point better than the official model in matterport/Mask_RCNN with the same R101-FPN backbone.
NOTES.md has some notes about implementation details & speed.