Skip to content

Latest commit

 

History

History
1524 lines (1471 loc) · 116 KB

MODEL_ZOO.md

File metadata and controls

1524 lines (1471 loc) · 116 KB

Detectron Model Zoo and Baselines

Introduction

This file documents a large collection of baselines trained with Detectron, primarily in late December 2017. We refer to these results as the 12_2017_baselines. All configurations for these baselines are located in the configs/12_2017_baselines directory. The tables below provide results and useful statistics about training and inference. Links to the trained models as well as their output are provided. Unless noted differently below (see "Notes" under each table), the following common settings are used for all training and inference runs.

Common Settings and Notes

  • All baselines were run on Big Basin servers with 8 NVIDIA Tesla P100 GPU accelerators (with 16GB GPU memory, CUDA 8.0, and cuDNN 6.0.21).
  • All baselines were trained using 8 GPU data parallel sync SGD with a minibatch size of either 8 or 16 images (see the im/gpu column).
  • For training, only horizontal flipping data augmentation was used.
  • For inference, no test-time augmentations (e.g., multiple scales, flipping) were used.
  • All models were trained on the union of coco_2014_train and coco_2014_valminusminival, which is exactly equivalent to the recently defined coco_2017_train dataset.
  • All models were tested on the coco_2014_minival dataset, which is exactly equivalent to the recently defined coco_2017_val dataset.
  • Inference times are often expressed as "X + Y", in which X is time taken in reasonably well-optimized GPU code and Y is time taken in unoptimized CPU code. (The CPU code time could be reduced substantially with additional engineering.)
  • Inference results for boxes, masks, and keypoints ("kps") are provided in the COCO json format.
  • The model id column is provided for ease of reference.
  • To check downloaded file integrity: for any download URL on this page, simply append .md5sum to the URL to download the file's md5 hash.
  • All models and results below are on the COCO dataset.
  • Baseline models and results for the Cityscapes dataset are coming soon!

Training Schedules

We use three training schedules, indicated by the lr schd column in the tables below.

  • 1x: For minibatch size 16, this schedule starts at a LR of 0.02 and is decreased by a factor of * 0.1 after 60k and 80k iterations and finally terminates at 90k iterations. This schedules results in 12.17 epochs over the 118,287 images in coco_2014_train union coco_2014_valminusminival (or equivalently, coco_2017_train).
  • 2x: Twice as long as the 1x schedule with the LR change points scaled proportionally.
  • s1x ("stretched 1x"): This schedule scales the 1x schedule by roughly 1.44x, but also extends the duration of the first learning rate. With a minibatch size of 16, it reduces the LR by * 0.1 at 100k and 120k iterations, finally ending after 130k iterations.

All training schedules also use a 500 iteration linear learning rate warm up. When changing the minibatch size between 8 and 16 images, we adjust the number of SGD iterations and the base learning rate according to the principles outlined in our paper Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.

License

All models available for download through this document are licensed under the Creative Commons Attribution-ShareAlike 3.0 license.

ImageNet Pretrained Models

The backbone models pretrained on ImageNet are available in the format used by Detectron. Unless otherwise noted, these models are trained on the standard ImageNet-1k dataset.

  • R-50.pkl: converted copy of MSRA's original ResNet-50 model
  • R-101.pkl: converted copy of MSRA's original ResNet-101 model
  • X-101-64x4d.pkl: converted copy of FB's original ResNeXt-101-64x4d model trained with Torch7
  • X-101-32x8d.pkl: ResNeXt-101-32x8d model trained with Caffe2 at FB
  • X-152-32x8d-IN5k.pkl: ResNeXt-152-32x8d model trained on ImageNet-5k with Caffe2 at FB (see our ResNeXt paper for details on ImageNet-5k)

Log Files

Training and inference logs are available for most models in the model zoo.

Proposal, Box, and Mask Detection Baselines

RPN Proposal Baselines

        backbone         type lr
schd
im/
gpu
train
mem
(GB)
train
time
(s/iter)
train
time
total
(hr)
inference
time
(s/im)
box
AP
mask
AP
kp
AP
prop.
AR
model id download
links
R-50-C4 RPN 1x 2 4.3 0.187 4.7 0.113 - - - 51.6 35998355 model | props: 123
R-50-FPN RPN 1x 2 6.4 0.416 10.4 0.080 - - - 57.2 35998814 model | props: 123
R-101-FPN RPN 1x 2 8.1 0.503 12.6 0.108 - - - 58.2 35998887 model | props: 123
X-101-64x4d-FPN RPN 1x 2 11.5 1.395 34.9 0.292 - - - 59.4 35998956 model | props: 123
X-101-32x8d-FPN RPN 1x 2 11.6 1.102 27.6 0.222 - - - 59.5 36760102 model | props: 123

Notes:

  • Inference time only includes RPN proposal generation.
  • "prop. AR" is proposal average recall at 1000 proposals per image.
  • Proposal download links ("props"): "1" is coco_2014_train; "2" is coco_2014_valminusminival; and "3" is coco_2014_minival.

Fast & Mask R-CNN Baselines Using Precomputed RPN Proposals

        backbone         type lr
schd
im/
gpu
train
mem
(GB)
train
time
(s/iter)
train
time
total
(hr)
inference
time
(s/im)
box
AP
mask
AP
kp
AP
prop.
AR
model id download
links
R-50-C4 Fast 1x 1 6.0 0.456 22.8 0.241 + 0.003 34.4 - - - 36224013 model | boxes
R-50-C4 Fast 2x 1 6.0 0.453 45.3 0.241 + 0.003 35.6 - - - 36224046 model | boxes
R-50-FPN Fast 1x 2 6.0 0.285 7.1 0.076 + 0.004 36.4 - - - 36225147 model | boxes
R-50-FPN Fast 2x 2 6.0 0.287 14.4 0.077 + 0.004 36.8 - - - 36225249 model | boxes
R-101-FPN Fast 1x 2 7.7 0.448 11.2 0.102 + 0.003 38.5 - - - 36228880 model | boxes
R-101-FPN Fast 2x 2 7.7 0.449 22.5 0.103 + 0.004 39.0 - - - 36228933 model | boxes
X-101-64x4d-FPN Fast 1x 1 6.3 0.994 49.7 0.292 + 0.003 40.4 - - - 36226250 model | boxes
X-101-64x4d-FPN Fast 2x 1 6.3 0.980 98.0 0.291 + 0.003 39.8 - - - 36226326 model | boxes
X-101-32x8d-FPN Fast 1x 1 6.4 0.721 36.1 0.217 + 0.003 40.6 - - - 37119777 model | boxes
X-101-32x8d-FPN Fast 2x 1 6.4 0.720 72.0 0.217 + 0.003 39.7 - - - 37121469 model | boxes
R-50-C4 Mask 1x 1 6.4 0.466 23.3 0.252 + 0.020 35.5 31.3 - - 36224121 model | boxes | masks
R-50-C4 Mask 2x 1 6.4 0.464 46.4 0.253 + 0.019 36.9 32.5 - - 36224151 model | boxes | masks
R-50-FPN Mask 1x 2 7.9 0.377 9.4 0.082 + 0.019 37.3 33.7 - - 36225401 model | boxes | masks
R-50-FPN Mask 2x 2 7.9 0.377 18.9 0.083 + 0.018 37.7 34.0 - - 36225732 model | boxes | masks
R-101-FPN Mask 1x 2 9.6 0.539 13.5 0.111 + 0.018 39.4 35.6 - - 36229407 model | boxes | masks
R-101-FPN Mask 2x 2 9.6 0.537 26.9 0.109 + 0.016 40.0 35.9 - - 36229740 model | boxes | masks
X-101-64x4d-FPN Mask 1x 1 7.3 1.036 51.8 0.292 + 0.016 41.3 37.0 - - 36226382 model | boxes | masks
X-101-64x4d-FPN Mask 2x 1 7.3 1.035 103.5 0.292 + 0.014 41.1 36.6 - - 36672114 model | boxes | masks
X-101-32x8d-FPN Mask 1x 1 7.4 0.766 38.3 0.223 + 0.017 41.3 37.0 - - 37121516 model | boxes | masks
X-101-32x8d-FPN Mask 2x 1 7.4 0.765 76.5 0.222 + 0.014 40.7 36.3 - - 37121596 model | boxes | masks

Notes:

  • Each row uses precomputed RPN proposals from the corresponding table row above that uses the same backbone.
  • Inference time excludes proposal generation.

End-to-End Faster & Mask R-CNN Baselines

        backbone         type lr
schd
im/
gpu
train
mem
(GB)
train
time
(s/iter)
train
time
total
(hr)
inference
time
(s/im)
box
AP
mask
AP
kp
AP
prop.
AR
model id download
links
R-50-C4 Faster 1x 1 6.3 0.566 28.3 0.167 + 0.003 34.8 - - - 35857197 model | boxes
R-50-C4 Faster 2x 1 6.3 0.569 56.9 0.174 + 0.003 36.5 - - - 35857281 model | boxes
R-50-FPN Faster 1x 2 7.2 0.544 13.6 0.093 + 0.004 36.7 - - - 35857345 model | boxes
R-50-FPN Faster 2x 2 7.2 0.546 27.3 0.092 + 0.004 37.9 - - - 35857389 model | boxes
R-101-FPN Faster 1x 2 8.9 0.647 16.2 0.120 + 0.004 39.4 - - - 35857890 model | boxes
R-101-FPN Faster 2x 2 8.9 0.647 32.4 0.119 + 0.004 39.8 - - - 35857952 model | boxes
X-101-64x4d-FPN Faster 1x 1 6.9 1.057 52.9 0.305 + 0.003 41.5 - - - 35858015 model | boxes
X-101-64x4d-FPN Faster 2x 1 6.9 1.055 105.5 0.304 + 0.003 40.8 - - - 35858198 model | boxes
X-101-32x8d-FPN Faster 1x 1 7.0 0.799 40.0 0.233 + 0.004 41.3 - - - 36761737 model | boxes
X-101-32x8d-FPN Faster 2x 1 7.0 0.800 80.0 0.233 + 0.003 40.6 - - - 36761786 model | boxes
R-50-C4 Mask 1x 1 6.6 0.620 31.0 0.181 + 0.018 35.8 31.4 - - 35858791 model | boxes | masks
R-50-C4 Mask 2x 1 6.6 0.620 62.0 0.182 + 0.017 37.8 32.8 - - 35858828 model | boxes | masks
R-50-FPN Mask 1x 2 8.6 0.889 22.2 0.099 + 0.019 37.7 33.9 - - 35858933 model | boxes | masks
R-50-FPN Mask 2x 2 8.6 0.897 44.9 0.099 + 0.018 38.6 34.5 - - 35859007 model | boxes | masks
R-101-FPN Mask 1x 2 10.2 1.008 25.2 0.126 + 0.018 40.0 35.9 - - 35861795 model | boxes | masks
R-101-FPN Mask 2x 2 10.2 0.993 49.7 0.126 + 0.017 40.9 36.4 - - 35861858 model | boxes | masks
X-101-64x4d-FPN Mask 1x 1 7.6 1.217 60.9 0.309 + 0.018 42.4 37.5 - - 36494496 model | boxes | masks
X-101-64x4d-FPN Mask 2x 1 7.6 1.210 121.0 0.309 + 0.015 42.2 37.2 - - 35859745 model | boxes | masks
X-101-32x8d-FPN Mask 1x 1 7.7 0.961 48.1 0.239 + 0.019 42.1 37.3 - - 36761843 model | boxes | masks
X-101-32x8d-FPN Mask 2x 1 7.7 0.975 97.5 0.240 + 0.016 41.7 36.9 - - 36762092 model | boxes | masks

Notes:

  • For these models, RPN and the detector are trained jointly and end-to-end.
  • Inference time is fully image-to-detections, including proposal generation.

RetinaNet Baselines

        backbone         type lr
schd
im/
gpu
train
mem
(GB)
train
time
(s/iter)
train
time
total
(hr)
inference
time
(s/im)
box
AP
mask
AP
kp
AP
prop.
AR
model id download
links
R-50-FPN RetinaNet 1x 2 6.8 0.483 12.1 0.125 35.7 - - - 36768636 model | boxes
R-50-FPN RetinaNet 2x 2 6.8 0.482 24.1 0.127 35.7 - - - 36768677 model | boxes
R-101-FPN RetinaNet 1x 2 8.7 0.666 16.7 0.156 37.7 - - - 36768744 model | boxes
R-101-FPN RetinaNet 2x 2 8.7 0.666 33.3 0.154 37.8 - - - 36768840 model | boxes
X-101-64x4d-FPN RetinaNet 1x 2 12.6 1.613 40.3 0.341 39.8 - - - 36768875 model | boxes
X-101-64x4d-FPN RetinaNet 2x 2 12.6 1.625 81.3 0.339 39.2 - - - 36768907 model | boxes
X-101-32x8d-FPN RetinaNet 1x 2 12.7 1.343 33.6 0.277 39.5 - - - 36769563 model | boxes
X-101-32x8d-FPN RetinaNet 2x 2 12.7 1.340 67.0 0.276 38.6 - - - 36769641 model | boxes

Notes: none

Mask R-CNN with Bells & Whistles

        backbone         type lr
schd
im/
gpu
train
mem
(GB)
train
time
(s/iter)
train
time
total
(hr)
inference
time
(s/im)
box
AP
mask
AP
kp
AP
prop.
AR
model id download
links
X-152-32x8d-FPN-IN5k Mask s1x 1 9.6 1.188 85.8 12.100 + 0.046 48.1 41.5 - - 37129812 model | boxes | masks
[above without test-time aug.] 0.325 + 0.018 45.2 39.7 - -

Notes:

  • A deeper backbone architecture is used: ResNeXt-152-32x8d-FPN
  • The backbone ResNeXt-152-32x8d model was trained on ImageNet-5k (not the usual ImageNet-1k)
  • Training uses multi-scale jitter over scales {640, 672, 704, 736, 768, 800}
  • Row 1: test-time augmentations are multi-scale testing over {400, 500, 600, 700, 900, 1000, 1100, 1200} and horizontal flipping (on each scale)
  • Row 2: same model as row 1, but without any test-time augmentation (i.e., same as the common baseline configuration)
  • Like the other results, this is a single model result (it is not an ensemble of models)

Keypoint Detection Baselines

Common Settings for Keypoint Detection Baselines (That Differ from Boxes and Masks)

Our keypoint detection baselines differ from our box and mask baselines in a couple of details:

  • Due to less training data for the keypoint detection task compared with boxes and masks, we enable multi-scale jitter during training for all keypoint detection models. (Testing is still without any test-time augmentations by default.)
  • Models are trained only on images from coco_2014_train union coco_2014_valminusminival that contain at least one person with keypoint annotations (all other images are discarded from the training set).
  • Metrics are reported for the person class only (still run on the entire coco_2014_minival dataset).

Person-Specific RPN Baselines

        backbone         type lr
schd
im/
gpu
train
mem
(GB)
train
time
(s/iter)
train
time
total
(hr)
inference
time
(s/im)
box AP mask AP kp AP prop. AR model id download
links
R-50-FPN RPN 1x 2 6.4 0.391 9.8 0.082 - - - 64.0 35998996 model | props: 123
R-101-FPN RPN 1x 2 8.1 0.504 12.6 0.109 - - - 65.2 35999521 model | props: 123
X-101-64x4d-FPN RPN 1x 2 11.5 1.394 34.9 0.289 - - - 65.9 35999553 model | props: 123
X-101-32x8d-FPN RPN 1x 2 11.6 1.104 27.6 0.224 - - - 66.2 36760438 model | props: 123

Notes:

  • Metrics are for the person category only.
  • Inference time only includes RPN proposal generation.
  • "prop. AR" is proposal average recall at 1000 proposals per image.
  • Proposal download links ("props"): "1" is coco_2014_train; "2" is coco_2014_valminusminival; and "3" is coco_2014_minival. These include all images, not just the ones with valid keypoint annotations.

Keypoint-Only Mask R-CNN Baselines Using Precomputed RPN Proposals

        backbone         type lr
schd
im/
gpu
train
mem
(GB)
train
time
(s/iter)
train
time
total
(hr)
inference
time
(s/im)
box AP mask AP kp AP prop. AR model id download
links
R-50-FPN Kps 1x 2 7.7 0.533 13.3 0.081 + 0.087 52.7 - 64.1 - 37651787 model | boxes | kps
R-50-FPN Kps s1x 2 7.7 0.533 19.2 0.080 + 0.085 53.4 - 65.5 - 37651887 model | boxes | kps
R-101-FPN Kps 1x 2 9.4 0.668 16.7 0.109 + 0.080 53.5 - 65.0 - 37651996 model | boxes | kps
R-101-FPN Kps s1x 2 9.4 0.668 24.1 0.108 + 0.076 54.6 - 66.0 - 37652016 model | boxes | kps
X-101-64x4d-FPN Kps 1x 2 12.8 1.477 36.9 0.288 + 0.077 55.8 - 66.7 - 37731079 model | boxes | kps
X-101-64x4d-FPN Kps s1x 2 12.9 1.478 53.4 0.286 + 0.075 56.3 - 67.1 - 37731142 model | boxes | kps
X-101-32x8d-FPN Kps 1x 2 12.9 1.215 30.4 0.219 + 0.084 55.4 - 66.2 - 37730253 model | boxes | kps
X-101-32x8d-FPN Kps s1x 2 12.9 1.214 43.8 0.218 + 0.071 55.9 - 67.0 - 37731010 model | boxes | kps

Notes:

  • Metrics are for the person category only.
  • Each row uses precomputed RPN proposals from the corresponding table row above that uses the same backbone.
  • Inference time excludes proposal generation.

End-to-End Keypoint-Only Mask R-CNN Baselines

        backbone         type lr
schd
im/
gpu
train
mem
(GB)
train
time
(s/iter)
train
time
total
(hr)
inference
time
(s/im)
box AP mask AP kp AP prop. AR model id download
links
R-50-FPN Kps 1x 2 9.0 0.832 20.8 0.097 + 0.092 53.6 - 64.2 - 37697547 model | boxes | kps
R-50-FPN Kps s1x 2 9.0 0.828 29.9 0.096 + 0.089 54.3 - 65.4 - 37697714 model | boxes | kps
R-101-FPN Kps 1x 2 10.6 0.923 23.1 0.124 + 0.084 54.5 - 64.8 - 37697946 model | boxes | kps
R-101-FPN Kps s1x 2 10.6 0.921 33.3 0.123 + 0.083 55.3 - 65.8 - 37698009 model | boxes | kps
X-101-64x4d-FPN Kps 1x 2 14.1 1.655 41.4 0.302 + 0.079 56.3 - 66.0 - 37732355 model | boxes | kps
X-101-64x4d-FPN Kps s1x 2 14.1 1.731 62.5 0.322 + 0.074 56.9 - 66.8 - 37732415 model | boxes | kps
X-101-32x8d-FPN Kps 1x 2 14.2 1.410 35.3 0.235 + 0.080 56.0 - 66.0 - 37792158 model | boxes | kps
X-101-32x8d-FPN Kps s1x 2 14.2 1.408 50.8 0.236 + 0.075 56.9 - 67.0 - 37732318 model | boxes | kps

Notes:

  • Metrics are for the person category only.
  • For these models, RPN and the detector are trained jointly and end-to-end.
  • Inference time is fully image-to-detections, including proposal generation.