RetinaNet is a single-stage object detection model.
Model | Download | Download (with sample test data) | ONNX version | Opset version | Accuracy |
---|---|---|---|---|---|
RetinaNet (ResNet101 backbone) | 228.4 MB | 153.3 MB | 1.6.0 | 9 | mAP 0.376 |
A sample script for ONNX model conversion and ONNXRuntime inference can be found here.
The model expects mini-batches of 3-channel input images of shape (N x 3 x H x W), where N is batch size.
The images have to be loaded in to a range of [0, 1] and then normalized using mean = [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225]. The transformation should preferably happen at preprocessing.
The following code shows how to preprocess a NCHW tensor:
from torchvision import transforms
preprocess = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
input_tensor = preprocess(input_image)
# Create a mini-batch as expected by the model.
input_batch = input_tensor.unsqueeze(0)
Model has 2 outputs:
Classification heads: 5 tensors of rank 4, each tensor corresponding to the classifying anchor box heads of one feature level in the feature pyramid network.
Bounding box regression heads: 5 tensors of rank 4, each tensor corresponding to the regressions (from anchor boxes to object boxes) of one feature level in the feature pyramid network.
Output sizes depend on input images sizes and model (subnetworks convolutional layers) parameters.
Output dimensions for an image mini-batch of size [1, 3, 480, 640]:
- Class heads sizes: [1, 720, 60, 80], [1, 720, 30, 40], [1, 720, 15, 20], [1, 720, 8, 10], [1, 720, 4, 5]
- Regression box heads sizes: [1, 36, 60, 80], [1, 36, 30, 40], [1, 36, 15, 20], [1, 36, 8, 10], [1, 36, 4, 5]
The following script from NVIDIA/retinanet-examples shows how to:
- Generate anchor boxes.
- Decode and then filter box predictions from at most 1k top-scoring predictions per level, with confidence threshold of 0.05.
- Apply non-maximum suppression on anchor boxes to get ground-truth boxes, scores, and labels.
import torch
from retinanet.box import generate_anchors, decode, nms
def detection_postprocess(image, cls_heads, box_heads):
# Inference post-processing
anchors = {}
decoded = []
for cls_head, box_head in zip(cls_heads, box_heads):
# Generate level's anchors
stride = image.shape[-1] // cls_head.shape[-1]
if stride not in anchors:
anchors[stride] = generate_anchors(stride, ratio_vals=[1.0, 2.0, 0.5],
scales_vals=[4 * 2 ** (i / 3) for i in range(3)])
# Decode and filter boxes
decoded.append(decode(cls_head, box_head, stride,
threshold=0.05, top_n=1000, anchors=anchors[stride]))
# Perform non-maximum suppression
decoded = [torch.cat(tensors, 1) for tensors in zip(*decoded)]
# NMS threshold = 0.5
scores, boxes, labels = nms(*decoded, nms=0.5, ndetections=100)
return scores, boxes, labels
scores, boxes, labels = detection_postprocess(input_image, cls_heads, box_heads)
Model backbone is initialized using pre-trained pytorch/vision ResNet101 model. This model is pre-trained on the ImageNet
dataset.
The accuracies obtained by the model on the validation set is provided by NVIDIA/retinanet-examples. Metric is COCO mAP (averaged over IoU of 0.5:0.95), computed over COCO 2017 val data.
Focal Loss for Dense Object Detection
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollár. arXiv, 2017.
This model is converted directly from NVIDIA/retinanet-examples.
BSD 3-Clause "New" or "Revised" License