-
Notifications
You must be signed in to change notification settings - Fork 2
/
README.Rmd
87 lines (65 loc) · 3.85 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# hocr <img src="man/figures/logo.png" align="right" />
The goal of **`hocr`** is to facilitate post-OCR data processing and wrangling. The package exposes hocr parcer, `hocr_parse`, which converts [XHTML format output](https://en.wikipedia.org/wiki/HOCR) into tidy tibble with one word per row. In addition to the columns exported by [`tesseract::ocr_data`](https://github.com/ropensci/tesseract), `hocr` outputs additional metadata regarding organization of words into lines, paragraphs, content areas and pages. Read more about hOCR specification [here](https://github.com/kba/hocr-spec).
One of the key elements of `hocr` format is "bounding box" - a rectangular region of the image covering the extent of the word recognized by `tesseract`. This bbox can be used to extract respective part of the image using, for example `magick` package, using `bbox_to_geometry` helper function.
`hocr` aslo includes tidiers for common hOCR-capable systems. As of version 0.0.9000 only `tesseract` output format is supported, but in the future, support for [`OCRopus`](https://github.com/tmbdev/ocropy) will be added.
## Installation
You can install the development version from [GitHub](https://github.com/) with:
``` r
# install.packages("devtools")
devtools::install_github("dmi3kno/hocr")
```
## Example
This is a basic example which shows you how to solve a common problem:
```{r, message=FALSE, warning=FALSE}
library(hocr)
library(tesseract) # OCR
library(tidyverse) # data wrangling and viz
#devtools::install_github("thomasp85/patchwork")
library(patchwork) # arranging plots
```
We will OCR a page from an old cookbook retrieved from archive.org[1] and enhanced using `magick` package (see image preparation script on [github](https://github.com/dmi3kno/hocr/blob/master/data-raw/prepare.R)).
```{r example}
cupcakes <- system.file("extdata", "peanutbutter.png", package="hocr")
recipe <- tesseract::ocr(cupcakes, HOCR = TRUE) %>%
hocr::hocr_parse() %>%
hocr::tidy_tesseract()
recipe
```
Now that data is in the tidy format, lets render the page in ggplot and identify bounding boxes around words and paragraphs to illustrate the benefits of parsed document structure. `tesseract` outputs bboxes in upper-left corner coordinate system. We will transform all y-values to bottom-left scale and plot the bounding boxes alongside with the original picture, colored by `tesseract` confidence score.
```{r, fig.width=8, fig.height=6}
p1 <- recipe %>%
mutate(ocrx_word_bbox=lapply(ocrx_word_bbox, function(x)
separate(as_tibble(x), value, into=c("word_x1", "word_y1", "word_x2", "word_y2"), convert = TRUE))) %>%
unnest(ocrx_word_bbox) %>%
mutate(ocr_page_bbox=lapply(ocr_page_bbox, function(x)
separate(as_tibble(x), value, into=c("page_x1", "page_y1", "page_x2", "page_y2"), convert = TRUE))) %>%
unnest(ocr_page_bbox) %>%
mutate(word_y1=max(page_y2)-word_y1,
word_y2=max(page_y2)-word_y2) %>%
ggplot(aes(xmin=word_x1, ymin=word_y1, xmax=word_x2, ymax=word_y2))+
geom_rect(aes(color=ocr_par_id, fill=ocrx_word_conf), show.legend = TRUE)+
theme_minimal()+
theme(panel.grid = element_blank(),
axis.text = element_text(size = 7),
legend.text = element_text(size = 7),
legend.title = element_text(size = 7))
library(png)
library(grid)
img <- readPNG(cupcakes)
p2 <- rasterGrob(img, interpolate=TRUE)
p1+p2
```
Similar projects are listed [here](https://github.com/kba/awesome-ocr#hocr)
[1] Rosenberg L. M.(1986) Muffins & cupcakes, American Cooking Guild, Gaithersburg, MD. Openlibrary edition OL1484439M. Accessed from: https://archive.org/details/muffinscupcakes00rose on 28 July 2018