Skip to content

v0.5.0: Skew-aware OCR & extended model/dataset zoo

Compare
Choose a tag to compare
@fg-mindee fg-mindee released this 31 Dec 18:32
· 357 commits to main since this release
b9d8feb

This release adds support of rotated documents, and extends both the model & dataset zoos.

Note: doctr 0.5.0 requires either TensorFlow 2.4.0 or PyTorch 1.8.0.

Highlights

🙃 😃 Rotation-aware text detection 🙃 😃

It's no secret: this release focus was to bring the same level of performance to rotated documents!

predictions

docTR is meant to be your best tool for seamless document processing, and it couldn't do without supporting a very natural & common augmentation of input documents. This large project was subdivided into three parts:

Straightening pages before text detection

Developing a heuristic-based method to estimate the page skew, and rotate it before forwarding it to any deep learning model. Our thanks to @Rob192 for his contribution on this part 🙏

This behaviour can be enabled to avoid retraining the text detection models. However, the heuristics approach has its limits in terms of robustness.

Text detection training with rotated images

doctr_sample

The core of this project was to enable our text detection models to produce non-degraded heatmaps & localization candidates when processing a rotated page.

Crop orientation resolution

rot2

Finally, once the localization candidates have been extracted, there is no saying that this localization candidate will read from left to right. In order to remove this doubt, a lightweight image orientation classifier was added to refine the crops that will be sent to text recognition!

🦓 A wider pretrained classification model zoo 🦓

The stability of trainings in deep learning for complex tasks has mostly been helped by leveraging transfer learning. As such, OCR tasks usually require a backbone as a feature extractor. For this reason, all checkpoints of classification models in both PyTorch & TensorFlow have been updated 🚀
Those were trained using our synthetic character classification dataset, for more details cf. Character classification training

🖼️ New public datasets join the fray

Thanks to @felixdittrich92, the list of supported datasets has considerably grown 🥳
This includes widely popular datasets used for benchmarks on OCR-related tasks, you can find the full list over here 👉 #587

Synthetic text recognition dataset

Additionally, we followed up on the existing CharGenerator by introducing WordGenerator:

  • generates an image of word of length randomly sampled within a specified range, with characters randomly sampled from the specified vocab.
  • you can even pass a list of fonts so that each word font family is randomly picked among them

Below are some samples using a font_size=32:
wordgenerator_sample

📑 New notebooks

Two new notebooks have made their way into the documentation:

  • producing searchable PDFs from docTR analysis results
  • introduction to document artefact detection (QR code, bar codes, ID pictures, etc.) with docTR

image

Breaking changes

Revamp of classification models

With the retraining of all classification backbones, several changes have been introduced:

  • Model naming: linknet16 --> linknet_resnet18
  • Architecture changes: all classification backbones are available with a classification head now.

Enforcing relative coordinates in datasets

In order to unify our data pipelines, we forced the conversion to relative coordinates on all datasets!

0.4.1 0.5.0
>>> from doctr.datasets import FUNSD
>>> ds = FUNSD(train=True, download=True)
>>> img, target = ds[0]
>>> print(target['boxes'].dtype, target['boxes'].max())
(dtype('int64'), 862)
>>> from doctr.datasets import FUNSD
>>> ds = FUNSD(train=True, download=True)
>>> img, target = ds[0]
>>> print(target['boxes'].dtype, target['boxes'].max())
(dtype('float32'), 0.98341835)

Full changelog

Breaking Changes 🛠

New Features

Bug Fixes

Improvements

Miscellaneous

Full Changelog: v0.4.1...v0.5.0