Project page: http://www.robots.ox.ac.uk/~tvg/projects/StraightToShapes
Abstract :This work accomplishes direct regression to objects' shapes in addition to their bounding boxes and categories by introducing a compact and decodable shape embedding space using a denoising autoencoder. A deep convolutional network is trained to regress to the low dimensional shape vectors which are then mapped to shape masks using the decoder half of the autoencoder. Our end-to-end network qualifies as the first real-time instance segmentation pipeline running at ~35FPS while yielding promising results at the task. Proposed top-down regression to object shape masks through a semantically defined shape space allows the network to generalize to unseen categories at test time. We call this zero-shot segmentation and evaluate the performance of our model at the task to establish a baseline for future research to be measured against.
mAP estimates over 2000 randomly selected train and val images from SBD dataset as network architectures minimize proxy L2-regression loss over output set of {shape params, bounding-box params, class-probabilities}.
[BN: using Batch normalization, DA: using stronger Data augmentation].
Archi. | Shape space | [email protected] | [email protected] | mAP vol | Runtime (ms) |
---|---|---|---|---|---|
Ours | |||||
YOLO | Binary mask | 32.3 | 12.0 | 28.6 | 26.3* |
YOLO | Radial | 30.0 | 6.5 | 29.0 | 27.1* |
YOLO | Embedding (50) | 32.6 | 14.8 | 28.9 | 30.5* |
YOLO | Embedding (20) | 34.6 | 15.0 | 31.5 | 28.0* |
YOLO-BN | Embedding (20) | 38.6 | 17.4 | 34.3 | 28.0* |
YOLO-BN-DA | Embedding (20) | 42.3 | 20.8 | 36.9 | 28.0* |
Others | |||||
SDS | - | 49.7 | - | 41.4 | 48K |
MNC | - | 65.0 | 46.3 | - | 330 |
* These present the computational time of the entire application (including display and wait time for the user) not just the feed-forward prediction on a high-end desktop with a Titan-X processor (GMT 200) using CUDNN version 5.0.0.
(a) Correct predictions using downsampled binary masks [YOLO-binarymask].
(b) Correct predictions using 20D learnt shape encodings [YOLO-embedding20]. In column-3, the horns of the cow are missed and the human shape mask gets elongated due to an incorrect bounding box prediction.
(c) Missed detections using the 20D shape encodings [YOLO-embedding20]. The network misses out or false fires on small objects in column-2, the dogs in the images are falsely categorised as cats, and the sofa incorrectly includes the nearby dining table.
The following evaluation has been performed on 8037 images from the COCO (80 categories) val set using our embedding (50) model trained on train+val SBD set. These images have objects strictly from the 60 categories that are not in common with PASCAL-VOC (20 categories) dataset. The code to run this evaluation experiment is a modification over the COCO-evaluation (matlab) benchmark code and can be found here.
Archi. | Shape space | [email protected] (all) | [email protected] (large) | [email protected] (large) |
---|---|---|---|---|
YOLO | Embedding (50) | 3.6 | 7.1 | 23.2 |
A comparison between the state-of-the-art (a) semantic segmentation, (b) instance segmentation, and (c) our shape detection results using YOLO-embedding50, on images from YouTube videos of animals that are not present in the PASCAL training set.
In the first two rows, instance segmentation predicts that the legs of the tiger are human. Our method is more consistent over the tiger images taken from the same video.
In the lower rows, the instance segmentation approach of (b) fails to predict any segments, whilst our method predicts class 'dog' for the tiger, hedgehog, baby elephant and bear, and class 'horse' for the large elephant.
This version of the StraightToShapes concept was implemented by Saumya Jetley and Michael Sapienza and Stuart Golodetz, under the supervision of Professor Philip Torr. Additional experiments with batch normalization, data augmentation have been contributed by Laurynas Miksys.
It is built on top of Darknet, an open-source neural network framework developed by Joseph Redmon.
If you build on this framework for your research, please consider citing the original research paper:
@article{JetleySapienza2016,
author = {Saumya Jetley and
Michael Sapienza and
Stuart Golodetz and
Philip H. S. Torr},
title = {Straight to Shapes: Real-time Detection of Encoded Shapes},
journal = {CoRR},
volume = {abs/1611.07932},
year = {2016},
url = {http://arxiv.org/abs/1611.07932},
}
To install the software, follow the instructions provided HERE.
This work is protected by a license agreement.