Skip to content

Commit

Permalink
Merge pull request #6 from UB-Mannheim/new
Browse files Browse the repository at this point in the history
Merge changes by UB-Mannheim
  • Loading branch information
zuphilip committed May 16, 2016
2 parents 839945b + 8fd1695 commit 74d9f3e
Show file tree
Hide file tree
Showing 23 changed files with 370 additions and 156 deletions.
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
/venv
/*_venv
/build/
/dist/
/*.egg-info/
72 changes: 0 additions & 72 deletions README

This file was deleted.

180 changes: 167 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,174 @@
# About
# hocr-tools

hOCR is a format for representing OCR output, including layout information, character confidences, bounding boxes, and style information. It embeds this information invisibly in standard HTML. By building on standard HTML, it automatically inherits well-defined support for most scripts, languages, and common layout options. Furthermore, unlike previous OCR formats, the recognized text and OCR-related information co-exist in the same file and survives editing and manipulation. hOCR markup is independent of the presentation.
* [About](#about)
* [About the code](#about-the-code)
* [Pointers](#pointers)
* [Installation](#installation)
* [System-wide](#system-wide)
* [Virtualenv](#virtualenv)
* [Available Programs](#available-programs)
* [hocr-check](#hocr-check) -- check the hOCR file for errors
* [hocr-combine](#hocr-combine) -- combine pages in multiple hOCR files into a single document
* [hocr-eval](#hocr-eval) -- compute number of segmentation and OCR errors
* [hocr-eval-geom](#hocr-eval-geom) -- compute over, under, and mis-segmentations
* [hocr-eval-lines](#hocr-eval-lines) -- compute OCR errors of hOCR output relative to text ground truth
* [hocr-extract-g1000](#hocr-extract-g1000) -- extract lines from Google 1000 book sample
* [hocr-extract-images](#hocr-extract-images) -- extract the images and texts within all the ocr_line elements
* [hocr-lines](#hocr-lines) -- extract the text within all the ocr_line elements
* [hocr-merge-dc](#hocr-merge-dc) -- merge Dublin Core meta data into the hOCR HTML header
* [hocr-pdf](#hocr-pdf) -- create a searchable PDF from a pile of hOCR and JPEG
* [hocr-split](#hocr-split) -- split an hOCR file into individual pages

## About

hOCR is a format for representing OCR output, including layout information,
character confidences, bounding boxes, and style information.
It embeds this information invisibly in standard HTML.
By building on standard HTML, it automatically inherits well-defined support
for most scripts, languages, and common layout options.
Furthermore, unlike previous OCR formats, the recognized text and OCR-related
information co-exist in the same file and survives editing and manipulation.
hOCR markup is independent of the presentation.

There is a [Public Specification](http://docs.google.com/View?docid=dfxcv4vc_67g844kf) for the hOCR Format.

# Available Programs
### About the code

Each command line program is self contained; if you have the right
Python packages installed, it should just work. (Unfortunately, that
means some code duplication; we may revisit this issue in later
revisions.)

### Pointers

The format itself is defined here:

http://docs.google.com/View?docID=dfxcv4vc_67g844kf&revision=_latest

## Installation

### System-wide

On a Debian/Ubuntu system, install the dependencies from packages:

```
sudo apt-get install python-lxml python-reportlab python-pil \
python-beautifulsoup python-numpy python-scipy python-matplotlib
```

Or, to fetch dependencies from the [cheese shop](https://pypi.python.org/pypi):

```
sudo pip install -r requirements.txt # basic
```

Then install the dist:

```
sudo python setup.py install
```

### Virtualenv

Once

```
virtualenv venv
source venv/bin/activate
pip install -r requirements.txt
```

Subsequently

```
source venv/bin/activate
./hocr-...
```

## Available Programs

Included command line programs:

* hocr-check -- check the hOCR file for errors
* hocr-combine -- combine pages in multiple hOCR files into a single document
* hocr-eval -- compute number of segmentation and OCR errors
* hocr-eval-geom -- compute over, under, and mis-segmentations
* hocr-eval-lines -- compute OCR errors of hOCR output relative to text ground truth
* hocr-extract-images -- extract the images and texts within all the ocr_line elements
* hocr-lines -- extract the text within all the ocr_line elements
* hocr-pdf -- create a searchable PDF from a pile of hOCR and JPEG
* hocr-split -- split an hOCR file into individual pages
* hocr-merge-dc -- merge Dublin Core meta data into the hOCR HTML header
### hocr-check

```
hocr-check file.html
```

Perform consistency checks on the hOCR file.

### hocr-combine

```
hocr-combine file1.html file2.html...
```

Combine the OCR pages contained in each HTML file into a single document.
The document metadata is taken from the first file.

### hocr-eval-lines

```
hocr-eval-lines [-v] true-lines.txt hocr-actual.html
```

Evaluate hOCR output against ASCII ground truth. This evaluation method
requires that the line breaks in true-lines.txt and the ocr_line elements
in hocr-actual.html agree (most ASCII output from OCR systems satisfies this
requirement).

### hocr-eval-geom

```
hocr-eval-geom [-e element-name] [-o overlap-threshold] hocr-truth hocr-actual
```

Compare the segmentations at the level of the element name (default: ocr_line).
Computes undersegmentation, oversegmentation, and missegmentation.

### hocr-eval

```
hocr-eval hocr-true.html hocr-actual.html
```

Evaluate the actual OCR with respect to the ground truth. This outputs
the number of OCR errors due to incorrect segmentation and the number
of OCR errors due to character recognition errors.

It works by aligning segmentation components geometrically, and for each
segmentation component that can be aligned, computing the string edit distance
of the text the segmentation component contains.

### hocr-extract-g1000

Extract lines from [Google 1000 book sample](http://commondatastorage.googleapis.com/books/icdar2007/README.txt)

### hocr-extract-images

TODO

### hocr-lines

TODO

### hocr-merge-dc

```
hocr-merge-dc dc.xml hocr.html > hocr-new.html
```

Merges the Dublin Core metadata into the hOCR file by encoding the data in its header.

### hocr-pdf

TODO

### hocr-split

```
hocr-split file.html pattern
```

Split a multipage hOCR file into hOCR files containing one page each.
The pattern should something like "base-%03d.html"
35 changes: 17 additions & 18 deletions hocr-check
Original file line number Diff line number Diff line change
@@ -1,10 +1,9 @@
#!/usr/bin/python
#!/usr/bin/env python

# check the given file for conformance with the hOCR format spec

import sys,os,string,re,getopt
from xml.dom.ext.reader import HtmlLib
from xml.xpath import Evaluate as xquery
from lxml import html

################################################################
### misc library code
Expand All @@ -18,7 +17,7 @@ def assoc(key,list):
### node properties

def get_prop(node,name):
title = node.getAttributeNS(None,'title')
title = node.get('title')
if not title: return None
props = title.split(';')
for prop in props:
Expand Down Expand Up @@ -66,50 +65,50 @@ nooverlap = (assoc('-o',optlist)=='')
if len(args)>0: stream = open(args[0])
elif len(args)>1: raise "can only check one file at a time"
else: stream = sys.stdin
doc = HtmlLib.Reader().fromString(stream.read())
doc = html.fromstring(stream.read())

################################################################
### XML structure checks
################################################################

# check for presence of meta information
assert xquery("//META[@name='ocr-id']",doc)!=[]
assert xquery("//META[@name='ocr-recognized']",doc)!=[]
assert doc.xpath("//meta[@name='ocr-id']")!=[]
assert doc.xpath("//meta[@name='ocr-recognized']")!=[]

# check for presence of page
assert xquery("//*[@class='ocr_page']",doc)!=[]
assert doc.xpath("//*[@class='ocr_page']")!=[]

# check that lines are inside pages
lines = xquery("//*[@class='ocr_line']",doc.documentElement)
lines = doc.xpath("//*[@class='ocr_line']")
for line in lines:
assert xquery("//*[@class='ocr_page']",line)
assert line.xpath("//*[@class='ocr_page']")

# check that pars are inside pages
pars = xquery("//*[@class='ocr_par']",doc.documentElement)
pars = doc.xpath("//*[@class='ocr_par']")
for par in pars:
assert xquery("//*[@class='ocr_page']",par)
assert par.xpath("//*[@class='ocr_page']")

# check that columns are inside pages
columns = xquery("//*[@class='ocr_column']",doc.documentElement)
columns = doc.xpath("//*[@class='ocr_column']")
for column in columns:
assert xquery("//*[@class='ocr_page']",column)
assert column.xpath("//*[@class='ocr_page']")

################################################################
### geometric checks
################################################################

if not nooverlap:
for page in xquery("//*[@class='ocr_page']",doc):
for page in doc.xpath("//*[@class='ocr_page']"):
# check lines
objs = xquery("//*[@class='ocr_line']",page)
objs = page.xpath("//*[@class='ocr_line']")
line_bboxes = [get_bbox(obj) for obj in objs if get_prop(obj,'bbox')]
assert mostly_nonoverlapping(line_bboxes)
# check paragraphs
objs = xquery("//*[@class='ocr_par']",page)
objs = page.xpath("//*[@class='ocr_par']")
par_bboxes = [get_bbox(obj) for obj in objs if get_prop(obj,'bbox')]
assert mostly_nonoverlapping(par_bboxes)
# check columns
objs = xquery("//*[@class='ocr_column']",page)
objs = page.xpath("//*[@class='ocr_column']")
column_bboxes = [get_bbox(obj) for obj in objs if get_prop(obj,'bbox')]
assert mostly_nonoverlapping(column_bboxes)

Expand Down
Loading

0 comments on commit 74d9f3e

Please sign in to comment.