README.Rmd

---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# hocr  <img src="man/figures/logo.png" align="right" />

The goal of **`hocr`** is to facilitate post-OCR data processing and wrangling. The package exposes hocr parcer, `hocr_parse`, which converts [XHTML format output](https://en.wikipedia.org/wiki/HOCR) into tidy tibble with one word per row. In addition to the columns exported by [`tesseract::ocr_data`](https://github.com/ropensci/tesseract), `hocr` outputs additional metadata regarding organization of words into lines, paragraphs, content areas and pages. Read more about hOCR specification [here](https://github.com/kba/hocr-spec).

One of the key elements of `hocr` format is "bounding box" - a rectangular region of the image covering the extent of the word recognized by `tesseract`. This bbox can be used to extract respective part of the image using, for example `magick` package, using `bbox_to_geometry` helper function.

`hocr` aslo includes tidiers for common hOCR-capable systems. As of version 0.0.9000 only `tesseract` output format is supported, but in the future, support for [`OCRopus`](https://github.com/tmbdev/ocropy) will be added.

## Installation

You can install the development version from [GitHub](https://github.com/) with:

``` r
# install.packages("devtools")
devtools::install_github("dmi3kno/hocr")
```
## Example

This is a basic example which shows you how to solve a common problem:

```{r, message=FALSE, warning=FALSE}
library(hocr)
library(tesseract) # OCR
library(tidyverse) # data wrangling and viz
#devtools::install_github("thomasp85/patchwork")
library(patchwork) # arranging plots
```

We will OCR a page from an old cookbook retrieved from archive.org[1] and enhanced using `magick` package (see image preparation script on [github](https://github.com/dmi3kno/hocr/blob/master/data-raw/prepare.R)).

```{r example}
cupcakes <- system.file("extdata", "peanutbutter.png", package="hocr")


recipe <- tesseract::ocr(cupcakes, HOCR = TRUE) %>% 
  hocr::hocr_parse() %>% 
  hocr::tidy_tesseract()
recipe
```

Now that data is in the tidy format, lets render the page in ggplot and identify bounding boxes around words and paragraphs to illustrate the benefits of parsed document structure. `tesseract` outputs bboxes in upper-left corner coordinate system.  We will transform all y-values to bottom-left scale and plot the bounding boxes alongside with the original picture, colored by `tesseract` confidence score.

```{r, fig.width=8, fig.height=6}
p1 <- recipe %>% 
  mutate(ocrx_word_bbox=lapply(ocrx_word_bbox, function(x) 
    separate(as_tibble(x), value, into=c("word_x1", "word_y1", "word_x2", "word_y2"), convert = TRUE))) %>% 
    unnest(ocrx_word_bbox) %>% 
  mutate(ocr_page_bbox=lapply(ocr_page_bbox, function(x) 
    separate(as_tibble(x), value, into=c("page_x1", "page_y1", "page_x2", "page_y2"), convert = TRUE))) %>% 
    unnest(ocr_page_bbox) %>% 
  mutate(word_y1=max(page_y2)-word_y1,
         word_y2=max(page_y2)-word_y2) %>% 
    ggplot(aes(xmin=word_x1, ymin=word_y1, xmax=word_x2, ymax=word_y2))+
    geom_rect(aes(color=ocr_par_id, fill=ocrx_word_conf), show.legend = TRUE)+
  theme_minimal()+
  theme(panel.grid = element_blank(), 
        axis.text = element_text(size = 7), 
        legend.text = element_text(size = 7), 
        legend.title = element_text(size = 7))

library(png)
library(grid)
img <- readPNG(cupcakes)
p2 <- rasterGrob(img, interpolate=TRUE)

p1+p2

```

Similar projects are listed [here](https://github.com/kba/awesome-ocr#hocr)

[1] Rosenberg L. M.(1986) Muffins & cupcakes, American Cooking Guild, Gaithersburg, MD. Openlibrary edition OL1484439M. Accessed from: https://archive.org/details/muffinscupcakes00rose on 28 July 2018