This is a minimal implementation that simply contains these files:
- coco.py: load COCO data
- data.py: prepare data for training
- common.py: common data preparation utilities
- basemodel.py: implement backbones
- model_box.py: implement box-related symbolic functions
- model_{fpn,rpn,frcnn,mrcnn,cascade}.py: implement FPN,RPN,Fast-/Mask-/Cascade-RCNN models.
- train.py: main training script
- utils/: third-party helper functions
- eval.py: evaluation utilities
- viz.py: visualization utilities
Data:
-
It's easy to train on your own data. Just replace
COCODetection.load_many
indata.py
by your own loader. Also remember to changeconfig.NUM_CLASS
andconfig.CLASS_NAMES
. The current evaluation code is also COCO-specific, and you need to change it to use your data and metrics. -
You can easily add more augmentations such as rotation, but be careful how a box should be augmented. The code now will always use the minimal axis-aligned bounding box of the 4 corners, which is probably not the optimal way. A TODO is to generate bounding box from segmentation, so more augmentations can be naturally supported.
Model:
- Floating-point boxes are defined like this:
-
We use ROIAlign, and
tf.image.crop_and_resize
is NOT ROIAlign. -
We currently only support single image per GPU.
-
Because of (3), BatchNorm statistics are supposed to be freezed during fine-tuning.
-
An alternative to freezing BatchNorm is to sync BatchNorm statistics across GPUs (the
BACKBONE.NORM=SyncBN
option). This would require my bugfix which is available since TF 1.10. You can manually apply the patch to use it. For now the total batch size is at most 8, so this option does not improve the model by much. -
Another alternative to BatchNorm is GroupNorm (
BACKBONE.NORM=GN
) which has better performance.
Speed:
-
If cudnn warmup is on, the training will start very slowly, until about 10k steps (or more if scale augmentation is used) to reach a maximum speed. As a result, the ETA is also inaccurate at the beginning. Warmup is by default on when no scale augmentation is used.
-
After warmup, the training speed will slowly decrease due to more accurate proposals.
-
This implementation is about 10% slower than detectron, probably due to the lack of specialized ops (e.g. AffineChannel, ROIAlign) in TensorFlow. It's certainly faster than other TF implementation.
-
The code should have around 70% GPU utilization on V100s, and 85%~90% scaling efficiency from 1 V100 to 8 V100s.
Possible Future Enhancements:
-
Define a better interface to load custom dataset.
-
Support batch>1 per GPU.
-
Use dedicated ops to improve speed. (e.g. a TF implementation of ROIAlign op can be found in light-head RCNN)