Skip to content

dmi3kno/hocr

Repository files navigation

hocr

The goal of hocr is to facilitate post-OCR data processing and wrangling. The package exposes hocr parcer, hocr_parse, which converts XHTML format output into tidy tibble with one word per row. In addition to the columns exported by tesseract::ocr_data, hocr outputs additional metadata regarding organization of words into lines, paragraphs, content areas and pages. Read more about hOCR specification here.

One of the key elements of hocr format is “bounding box” - a rectangular region of the image covering the extent of the word recognized by tesseract. This bbox can be used to extract respective part of the image using, for example magick package, using bbox_to_geometry helper function.

hocr aslo includes tidiers for common hOCR-capable systems. As of version 0.0.9000 only tesseract output format is supported, but in the future, support for OCRopus will be added.

Installation

You can install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("dmi3kno/hocr")

Example

This is a basic example which shows you how to solve a common problem:

library(hocr)
library(tesseract) # OCR
library(tidyverse) # data wrangling and viz
#devtools::install_github("thomasp85/patchwork")
library(patchwork) # arranging plots

We will OCR a page from an old cookbook retrieved from archive.org[1] and enhanced using magick package (see image preparation script on github).

cupcakes <- system.file("extdata", "peanutbutter.png", package="hocr")


recipe <- tesseract::ocr(cupcakes, HOCR = TRUE) %>% 
  hocr::hocr_parse() %>% 
  hocr::tidy_tesseract()
recipe
#> # A tibble: 234 x 21
#>    ocrx_word_id ocrx_word_bbox ocrx_word_conf ocrx_word_tag ocrx_word_value
#>    <chr>        <chr>                   <dbl> <chr>         <chr>          
#>  1 word_1_1     38 58 271 103              85 strong        Chocolate      
#>  2 word_1_2     287 61 451 103             86 text          Peanut         
#>  3 word_1_3     468 62 619 103             89 text          Butter         
#>  4 word_1_4     636 60 852 113             84 strong        Cupcakes       
#>  5 word_1_5     36 153 112 182             87 strong        Your           
#>  6 word_1_6     123 152 184 1~             88 strong        kids           
#>  7 word_1_7     196 152 250 1~             88 strong        will           
#>  8 word_1_8     264 152 324 1~             85 strong        love           
#>  9 word_1_9     337 152 417 1~             84 strong        these          
#> 10 word_1_10    431 154 472 1~             90 text          (as            
#> # ... with 224 more rows, and 16 more variables: ocr_line_id <chr>,
#> #   ocr_line_bbox <chr>, ocr_line_xbaseline <dbl>,
#> #   ocr_line_ybaseline <dbl>, ocr_line_xsize <dbl>,
#> #   ocr_line_xdescenders <dbl>, ocr_line_xascenders <dbl>,
#> #   ocr_par_id <chr>, ocr_par_lang <chr>, ocr_par_bbox <chr>,
#> #   ocr_carea_id <chr>, ocr_carea_bbox <chr>, ocr_page_id <chr>,
#> #   ocr_page_image <chr>, ocr_page_bbox <chr>, ocr_page_no <dbl>

Now that data is in the tidy format, lets render the page in ggplot and identify bounding boxes around words and paragraphs to illustrate the benefits of parsed document structure. tesseract outputs bboxes in upper-left corner coordinate system. We will transform all y-values to bottom-left scale and plot the bounding boxes alongside with the original picture, colored by tesseract confidence score.

p1 <- recipe %>% 
  mutate(ocrx_word_bbox=lapply(ocrx_word_bbox, function(x) 
    separate(as_tibble(x), value, into=c("word_x1", "word_y1", "word_x2", "word_y2"), convert = TRUE))) %>% 
    unnest(ocrx_word_bbox) %>% 
  mutate(ocr_page_bbox=lapply(ocr_page_bbox, function(x) 
    separate(as_tibble(x), value, into=c("page_x1", "page_y1", "page_x2", "page_y2"), convert = TRUE))) %>% 
    unnest(ocr_page_bbox) %>% 
  mutate(word_y1=max(page_y2)-word_y1,
         word_y2=max(page_y2)-word_y2) %>% 
    ggplot(aes(xmin=word_x1, ymin=word_y1, xmax=word_x2, ymax=word_y2))+
    geom_rect(aes(color=ocr_par_id, fill=ocrx_word_conf), show.legend = TRUE)+
  theme_minimal()+
  theme(panel.grid = element_blank(), 
        axis.text = element_text(size = 7), 
        legend.text = element_text(size = 7), 
        legend.title = element_text(size = 7))

library(png)
library(grid)
img <- readPNG(cupcakes)
p2 <- rasterGrob(img, interpolate=TRUE)

p1+p2

Similar projects are listed here

[1] Rosenberg L. M.(1986) Muffins & cupcakes, American Cooking Guild, Gaithersburg, MD. Openlibrary edition OL1484439M. Accessed from: https://archive.org/details/muffinscupcakes00rose on 28 July 2018

Releases

No releases published

Packages

No packages published

Languages