SSD-based object and text detection with Keras

This repository contains the implementation of various approaches to object detection in general and text detection/recognition in particular.

Its code was initially used to carry out the experiments for the author's master thesis End-to-End Scene Text Recognition based on Artificial Neural Networks and later extended with the implementation of more recent approaches.

Technical background

Most of the ideas used for this project go back to the following papers:

SSD: Single Shot MultiBox Detector arXiv:1512.02325

SSD is a generic object detector that does local regression and classification on multiple feature maps of a CNN to predict a dense population of bounding boxes, which are subsequently filtered by a confidence threshold and NMS.

TextBoxes: A Fast Text Detector with a Single Deep Neural Network arXiv:1611.06779

TextBoxes is a modification of SSD that uses non-square convolution kernels and prior boxes with a large aspect ratio to better detect horizontal text.

DSOD: Learning Deeply Supervised Object Detectors from Scratch arXiv:1708.01241

DSOD is a modification of SSD that uses DenseNet as backbone architecture and thus can be trained form scratch instead of depending on a pretrained VGG-16 model.

Detecting Oriented Text in Natural Images by Linking Segments arXiv:1703.06520

SegLink builds on SSD and detects oriented text by locally predicting text segments (objects in SSD) and there linking with each other. The segments (edges) and links (vertices) are considered as a graph and thresholded by confidence. The remaining groups are finally combined to form bounding boxes.

TextBoxes++: A Single-Shot Oriented Scene Text Detector arXiv:1801.02765

TextBoxes++ extends TextBoxes for arbitrary oriented text by predicting horizontal bounding boxes as well as quadrilaterals and oriented bounding boxes. It additionally uses the recognition score to eliminate false positives from the detection stage (currently not implemented).

Focal Loss for Dense Object Detection arXiv:1708.02002

The focal loss is a dynamically weighted version of the cross entropy loss that can better handle a large imbalance between the classes and focus the training process on the difficult samples. It can be applied to the aforementioned detectors, instead of hard negative mining, to overcome the dominance of the background class.

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition arXiv:1507.05717

CRNN is a relatively simple architecture with some convolutional-pooling blocks, followed by two bidirectional LSTM (GRU in this implementation) layers, which can be trained with a CTC for efficient text recognition. It can be used to read the text in the cropped bounding boxes generated by the text detectors mentioned above.

Supported datasets

Currently supported datasets for object detection are

PASCAL VOC
MS COCO

and supported datasets related to text are

ICDAR2015 FST
ICDAR2015 IST
SynthText
MSRA TD500
SVT
COCO Text

For more information about the datasets, see datasets.ipynb.

Dependencies

For suitable versions of the necessary dependencies, see environment.ipynb.

Usage

The usage of the code is quite straightforward, clone the repository and run the related Jupyter notebooks. Some of the scripts (e.g. for video and model conversion) can also be executed form the command line.

Name		Name	Last commit message	Last commit date
Latest commit History 177 Commits
data/images		data/images
images		images
misc		misc
thirdparty		thirdparty
utils		utils
.gitignore		.gitignore
CRNN_log.ipynb		CRNN_log.ipynb
CRNN_train.ipynb		CRNN_train.ipynb
DSOD_evaluation.ipynb		DSOD_evaluation.ipynb
DSOD_train.ipynb		DSOD_train.ipynb
LICENSE		LICENSE
README.md		README.md
RoboTT_dataset.ipynb		RoboTT_dataset.ipynb
RoboTT_train.ipynb		RoboTT_train.ipynb
SL_384x512.ipynb		SL_384x512.ipynb
SL_debug.ipynb		SL_debug.ipynb
SL_end2end_predict.ipynb		SL_end2end_predict.ipynb
SL_evaluate.ipynb		SL_evaluate.ipynb
SL_predict.ipynb		SL_predict.ipynb
SL_predict_1536x1536.ipynb		SL_predict_1536x1536.ipynb
SL_train.ipynb		SL_train.ipynb
SSD300_debug.ipynb		SSD300_debug.ipynb
SSD512_debug.ipynb		SSD512_debug.ipynb
SSD_debug.ipynb		SSD_debug.ipynb
SSD_evaluation.ipynb		SSD_evaluation.ipynb
SSD_predict.ipynb		SSD_predict.ipynb
SSD_predict_mulitscale.ipynb		SSD_predict_mulitscale.ipynb
SSD_train.ipynb		SSD_train.ipynb
TBPP_debug.ipynb		TBPP_debug.ipynb
TBPP_evaluate.ipynb		TBPP_evaluate.ipynb
TBPP_history.ipynb		TBPP_history.ipynb
TBPP_log.ipynb		TBPP_log.ipynb
TBPP_train.ipynb		TBPP_train.ipynb
TB_debug.ipynb		TB_debug.ipynb
TB_train.ipynb		TB_train.ipynb
TODO		TODO
crnn_data.py		crnn_data.py
crnn_metric.py		crnn_metric.py
crnn_model.py		crnn_model.py
crnn_utils.py		crnn_utils.py
data_coco.py		data_coco.py
data_cocotext.py		data_cocotext.py
data_icdar2015fst.py		data_icdar2015fst.py
data_icdar2015ist.py		data_icdar2015ist.py
data_robott.py		data_robott.py
data_svt.py		data_svt.py
data_synthtext.py		data_synthtext.py
data_td500.py		data_td500.py
data_utils.py		data_utils.py
data_voc.py		data_voc.py
datasets.ipynb		datasets.ipynb
datasets_augmentation.ipynb		datasets_augmentation.ipynb
datasets_examples.ipynb		datasets_examples.ipynb
datasets_generator.ipynb		datasets_generator.ipynb
plot_history.ipynb		plot_history.ipynb
plot_log.ipynb		plot_log.ipynb
sl_end2end_videotest.py		sl_end2end_videotest.py
sl_metric.py		sl_metric.py
sl_model.py		sl_model.py
sl_rosnode.py		sl_rosnode.py
sl_training.py		sl_training.py
sl_utils.py		sl_utils.py
sl_videotest.py		sl_videotest.py
ssd_data.py		ssd_data.py
ssd_dump_caffe_models.py		ssd_dump_caffe_models.py
ssd_fix_caffe_models.py		ssd_fix_caffe_models.py
ssd_metric.py		ssd_metric.py
ssd_model.py		ssd_model.py
ssd_model_dense.py		ssd_model_dense.py
ssd_model_resnet.py		ssd_model_resnet.py
ssd_training.py		ssd_training.py
ssd_utils.py		ssd_utils.py
ssd_videotest.py		ssd_videotest.py
tb_model.py		tb_model.py
tbpp_end2end_videotest.py		tbpp_end2end_videotest.py
tbpp_model.py		tbpp_model.py
tbpp_training.py		tbpp_training.py
tbpp_utils.py		tbpp_utils.py
video_stream_listener.py		video_stream_listener.py
video_stream_publisher.py		video_stream_publisher.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SSD-based object and text detection with Keras

Technical background

SSD: Single Shot MultiBox Detector arXiv:1512.02325

TextBoxes: A Fast Text Detector with a Single Deep Neural Network arXiv:1611.06779

DSOD: Learning Deeply Supervised Object Detectors from Scratch arXiv:1708.01241

Detecting Oriented Text in Natural Images by Linking Segments arXiv:1703.06520

TextBoxes++: A Single-Shot Oriented Scene Text Detector arXiv:1801.02765

Focal Loss for Dense Object Detection arXiv:1708.02002

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition arXiv:1507.05717

Supported datasets

Dependencies

Usage

Pretrained models

Converted SSD300 VOC

Converted SSD512 VOC

Converted SSD300 COCO

Converted SSD512 COCO

SegLink

SegLink with DenseNet and Focal Loss

TextBoxes++ with DennseNet and Focal Loss

CRNN with LSTM

CRNN with GRU

Demo images

SSD on PASCAL VOC 2007 test

SegLink with DenseNet on SynthText

TextBoxes++ with DenseNet on SynthText

SegLink with DenseNet, Focal Loss and CRNN end-to-end on SynthText

SegLink with DenseNet, Focal Loss and CRNN end-to-end real-time recogniton

About

Releases

Packages

Languages

License

knowledge3/ssd_detectors

Folders and files

Latest commit

History

Repository files navigation

SSD-based object and text detection with Keras

Technical background

SSD: Single Shot MultiBox Detector arXiv:1512.02325

TextBoxes: A Fast Text Detector with a Single Deep Neural Network arXiv:1611.06779

DSOD: Learning Deeply Supervised Object Detectors from Scratch arXiv:1708.01241

Detecting Oriented Text in Natural Images by Linking Segments arXiv:1703.06520

TextBoxes++: A Single-Shot Oriented Scene Text Detector arXiv:1801.02765

Focal Loss for Dense Object Detection arXiv:1708.02002

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition arXiv:1507.05717

Supported datasets

Dependencies

Usage

Pretrained models

Demo images

SSD on PASCAL VOC 2007 test

SegLink with DenseNet on SynthText

TextBoxes++ with DenseNet on SynthText

SegLink with DenseNet, Focal Loss and CRNN end-to-end on SynthText

SegLink with DenseNet, Focal Loss and CRNN end-to-end real-time recogniton

About

Resources

License

Stars

Watchers

Forks

Languages