Merge pull request #6 from UB-Mannheim/new

Merge changes by UB-Mannheim
ocropus · May 16, 2016 · 74d9f3e · 74d9f3e
2 parents 839945b + 8fd1695
commit 74d9f3e
Show file tree

Hide file tree

Showing 23 changed files with 370 additions and 156 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,5 @@
+/venv
+/*_venv
+/build/
+/dist/
+/*.egg-info/
diff --git a/README b/README
diff --git a/README.md b/README.md
@@ -1,20 +1,174 @@
-# About
+# hocr-tools
 
-hOCR is a format for representing OCR output, including layout information, character confidences, bounding boxes, and style information.  It embeds this information invisibly in standard HTML.  By building on standard HTML, it automatically inherits well-defined support for most scripts, languages, and common layout options.  Furthermore, unlike previous OCR formats, the recognized text and OCR-related information co-exist in the same file and survives editing and manipulation.  hOCR markup is independent of the presentation.
+  * [About](#about)
+    * [About the code](#about-the-code)
+    * [Pointers](#pointers)
+  * [Installation](#installation)
+    * [System-wide](#system-wide)
+    * [Virtualenv](#virtualenv)
+  * [Available Programs](#available-programs)
+    * [hocr-check](#hocr-check) -- check the hOCR file for errors
+    * [hocr-combine](#hocr-combine) -- combine pages in multiple hOCR files into a single document
+    * [hocr-eval](#hocr-eval) -- compute number of segmentation and OCR errors
+    * [hocr-eval-geom](#hocr-eval-geom) -- compute over, under, and mis-segmentations
+    * [hocr-eval-lines](#hocr-eval-lines) -- compute OCR errors of hOCR output relative to text ground truth
+    * [hocr-extract-g1000](#hocr-extract-g1000) -- extract lines from Google 1000 book sample
+    * [hocr-extract-images](#hocr-extract-images) -- extract the images and texts within all the ocr_line elements
+    * [hocr-lines](#hocr-lines) -- extract the text within all the ocr_line elements
+    * [hocr-merge-dc](#hocr-merge-dc) -- merge Dublin Core meta data into the hOCR HTML header
+    * [hocr-pdf](#hocr-pdf) -- create a searchable PDF from a pile of hOCR and JPEG
+    * [hocr-split](#hocr-split) -- split an hOCR file into individual pages
+
+## About
+
+hOCR is a format for representing OCR output, including layout information,
+character confidences, bounding boxes, and style information.
+It embeds this information invisibly in standard HTML.
+By building on standard HTML, it automatically inherits well-defined support
+for most scripts, languages, and common layout options.
+Furthermore, unlike previous OCR formats, the recognized text and OCR-related
+information co-exist in the same file and survives editing and manipulation.
+hOCR markup is independent of the presentation.
 
 There is a [Public Specification](http://docs.google.com/View?docid=dfxcv4vc_67g844kf) for the hOCR Format.
 
-# Available Programs
+### About the code
+
+Each command line program is self contained; if you have the right
+Python packages installed, it should just work.  (Unfortunately, that
+means some code duplication; we may revisit this issue in later
+revisions.)
+
+### Pointers
+
+The format itself is defined here:
+
+http://docs.google.com/View?docID=dfxcv4vc_67g844kf&revision=_latest
+
+## Installation
+
+### System-wide
+
+On a Debian/Ubuntu system, install the dependencies from packages:
+
+```
+sudo apt-get install python-lxml python-reportlab python-pil \
+  python-beautifulsoup python-numpy python-scipy python-matplotlib
+```
+
+Or, to fetch dependencies from the [cheese shop](https://pypi.python.org/pypi):
+
+```
+sudo pip install -r requirements.txt  # basic
+```
+
+Then install the dist:
+
+```
+sudo python setup.py install
+```
+
+### Virtualenv
+
+Once
+
+```
+virtualenv venv
+source venv/bin/activate
+pip install -r requirements.txt
+```
+
+Subsequently
+
+```
+source venv/bin/activate
+./hocr-...
+```
+
+## Available Programs
 
 Included command line programs:
 
-  * hocr-check -- check the hOCR file for errors
-  * hocr-combine -- combine pages in multiple hOCR files into a single document
-  * hocr-eval -- compute number of segmentation and OCR errors
-  * hocr-eval-geom -- compute over, under, and mis-segmentations
-  * hocr-eval-lines -- compute OCR errors of hOCR output relative to text ground truth
-  * hocr-extract-images -- extract the images and texts within all the ocr_line elements
-  * hocr-lines -- extract the text within all the ocr_line elements
-  * hocr-pdf -- create a searchable PDF from a pile of hOCR and JPEG
-  * hocr-split -- split an hOCR file into individual pages
-  * hocr-merge-dc -- merge Dublin Core meta data into the hOCR HTML header
+### hocr-check
+
+```
+hocr-check file.html
+```
+
+Perform consistency checks on the hOCR file.
+
+### hocr-combine
+
+```
+hocr-combine file1.html file2.html...
+```
+
+Combine the OCR pages contained in each HTML file into a single document.
+The document metadata is taken from the first file.
+
+### hocr-eval-lines
+
+```
+hocr-eval-lines [-v] true-lines.txt hocr-actual.html
+```
+
+Evaluate hOCR output against ASCII ground truth.  This evaluation method
+requires that the line breaks in true-lines.txt and the ocr_line elements
+in hocr-actual.html agree (most ASCII output from OCR systems satisfies this
+requirement).
+
+### hocr-eval-geom
+
+```
+hocr-eval-geom [-e element-name] [-o overlap-threshold] hocr-truth hocr-actual
+```
+
+Compare the segmentations at the level of the element name (default: ocr_line).
+Computes undersegmentation, oversegmentation, and missegmentation.
+
+### hocr-eval
+
+```
+hocr-eval hocr-true.html hocr-actual.html
+```
+
+Evaluate the actual OCR with respect to the ground truth.  This outputs
+the number of OCR errors due to incorrect segmentation and the number
+of OCR errors due to character recognition errors.
+
+It works by aligning segmentation components geometrically, and for each
+segmentation component that can be aligned, computing the string edit distance
+of the text the segmentation component contains.
+
+### hocr-extract-g1000
+
+Extract lines from [Google 1000 book sample](http://commondatastorage.googleapis.com/books/icdar2007/README.txt)
+
+### hocr-extract-images
+
+TODO
+
+### hocr-lines
+
+TODO
+
+### hocr-merge-dc
+
+```
+hocr-merge-dc dc.xml hocr.html > hocr-new.html
+```
+
+Merges the Dublin Core metadata into the hOCR file by encoding the data in its header.
+
+### hocr-pdf
+
+TODO
+
+### hocr-split
+
+```
+hocr-split file.html pattern
+```
+
+Split a multipage hOCR file into hOCR files containing one page each.
+The pattern should something like "base-%03d.html"
diff --git a/hocr-check b/hocr-check
@@ -1,10 +1,9 @@
-#!/usr/bin/python
+#!/usr/bin/env python
 
 # check the given file for conformance with the hOCR format spec
 
 import sys,os,string,re,getopt
-from xml.dom.ext.reader import HtmlLib
-from xml.xpath import Evaluate as xquery
+from lxml import html
 
 ################################################################
 ### misc library code
@@ -18,7 +17,7 @@ def assoc(key,list):
 ### node properties
 
 def get_prop(node,name):
-    title = node.getAttributeNS(None,'title')
+    title = node.get('title')
     if not title: return None
     props = title.split(';')
     for prop in props:
@@ -66,50 +65,50 @@ nooverlap = (assoc('-o',optlist)=='')
 if len(args)>0: stream = open(args[0])
 elif len(args)>1: raise "can only check one file at a time"
 else: stream = sys.stdin
-doc = HtmlLib.Reader().fromString(stream.read())
+doc = html.fromstring(stream.read())
 
 ################################################################
 ### XML structure checks
 ################################################################
 
 # check for presence of meta information
-assert xquery("//META[@name='ocr-id']",doc)!=[]
-assert xquery("//META[@name='ocr-recognized']",doc)!=[]
+assert doc.xpath("//meta[@name='ocr-id']")!=[]
+assert doc.xpath("//meta[@name='ocr-recognized']")!=[]
 
 # check for presence of page
-assert xquery("//*[@class='ocr_page']",doc)!=[]
+assert doc.xpath("//*[@class='ocr_page']")!=[]
 
 # check that lines are inside pages
-lines = xquery("//*[@class='ocr_line']",doc.documentElement)
+lines = doc.xpath("//*[@class='ocr_line']")
 for line in lines:
-    assert xquery("//*[@class='ocr_page']",line)
+    assert line.xpath("//*[@class='ocr_page']")
 
 # check that pars are inside pages
-pars = xquery("//*[@class='ocr_par']",doc.documentElement)
+pars = doc.xpath("//*[@class='ocr_par']")
 for par in pars:
-    assert xquery("//*[@class='ocr_page']",par)
+    assert par.xpath("//*[@class='ocr_page']")
 
 # check that columns are inside pages
-columns = xquery("//*[@class='ocr_column']",doc.documentElement)
+columns = doc.xpath("//*[@class='ocr_column']")
 for column in columns:
-    assert xquery("//*[@class='ocr_page']",column)
+    assert column.xpath("//*[@class='ocr_page']")
 
 ################################################################
 ### geometric checks
 ################################################################    
 
 if not nooverlap:
-    for page in xquery("//*[@class='ocr_page']",doc):
+    for page in doc.xpath("//*[@class='ocr_page']"):
         # check lines
-        objs = xquery("//*[@class='ocr_line']",page)
+        objs = page.xpath("//*[@class='ocr_line']")
         line_bboxes = [get_bbox(obj) for obj in objs if get_prop(obj,'bbox')]
         assert mostly_nonoverlapping(line_bboxes)
         # check paragraphs
-        objs = xquery("//*[@class='ocr_par']",page)
+        objs = page.xpath("//*[@class='ocr_par']")
         par_bboxes = [get_bbox(obj) for obj in objs if get_prop(obj,'bbox')]
         assert mostly_nonoverlapping(par_bboxes)
         # check columns
-        objs = xquery("//*[@class='ocr_column']",page)
+        objs = page.xpath("//*[@class='ocr_column']")
         column_bboxes = [get_bbox(obj) for obj in objs if get_prop(obj,'bbox')]
         assert mostly_nonoverlapping(column_bboxes)