diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..debb92a --- /dev/null +++ b/.gitignore @@ -0,0 +1,5 @@ +/venv +/*_venv +/build/ +/dist/ +/*.egg-info/ diff --git a/README b/README deleted file mode 100644 index 5d2c3a6..0000000 --- a/README +++ /dev/null @@ -1,72 +0,0 @@ -====== hocr-tools ====== - -Tools for manipulating and evaluating the hOCR microformat for -representing multi-lingual OCR results. - -hOCR is a format for representing OCR output, including layout -information, character confidences, bounding boxes, and style -information. It embeds this information invisibly in standard HTML. By -building on standard HTML, it automatically inherits well-defined support -for most scripts, languages, and common layout options. Furthermore, -unlike previous OCR formats, the recognized text and OCR-related -information co-exist in the same file and survives editing and -manipulation. hOCR markup is independent of the presentation. - -====== the programs ====== - -=== hocr-check file.html === - -Perform consistency checks on the hOCR file. - -=== hocr-combine file1.html file2.html... === - -Combine the OCR pages contained in each HTML file into a single document. -The document metadata is taken from the first file. - -=== hocr-split file.html pattern === - -Split a multipage hOCR file into hOCR files containing one page each. -The pattern should something like "base-%03d.html" - -=== hocr-eval-lines [-v] true-lines.txt hocr-actual.html === - -Evaluate hOCR output against ASCII ground truth. This evaluation method -requires that the line breaks in true-lines.txt and the ocr_line elements -in hocr-actual.html agree (most ASCII output from OCR systems satisfies this -requirement). - -=== hocr-eval-geom [-e element-name] [-o overlap-threshold] hocr-truth hocr-actual === - -Compare the segmentations at the level of the element name (default: ocr_line). -Computes undersegmentation, oversegmentation, and missegmentation. - -=== hocr-eval hocr-true.html hocr-actual.html === - -Evaluate the actual OCR with respect to the ground truth. This outputs -the number of OCR errors due to incorrect segmentation and the number -of OCR errors due to character recognition errors. - -It works by aligning segmentation components geometrically, and for each -segmentation component that can be aligned, computing the string edit distance -of the text the segmentation component contains. - -=== hocr-merge-dc dc.xml hocr.html > hocr-new.html === - -Merges the Dublin Core metadata into the hOCR file by encoding the data in its header. - -====== about the code ====== - -Each command line program is self contained; if you have the right -Python packages installed, it should just work. (Unfortunately, that -means some code duplication; we may revisit this issue in later -revisions.) - -====== pointers ====== - -The format itself is defined here: - - http://docs.google.com/View?docID=dfxcv4vc_67g844kf&revision=_latest - -The project is hosted here: - - https://github.com/tmbdev/hocr-tools diff --git a/README.md b/README.md index bb93493..e1f6315 100644 --- a/README.md +++ b/README.md @@ -1,20 +1,174 @@ -# About +# hocr-tools -hOCR is a format for representing OCR output, including layout information, character confidences, bounding boxes, and style information. It embeds this information invisibly in standard HTML. By building on standard HTML, it automatically inherits well-defined support for most scripts, languages, and common layout options. Furthermore, unlike previous OCR formats, the recognized text and OCR-related information co-exist in the same file and survives editing and manipulation. hOCR markup is independent of the presentation. + * [About](#about) + * [About the code](#about-the-code) + * [Pointers](#pointers) + * [Installation](#installation) + * [System-wide](#system-wide) + * [Virtualenv](#virtualenv) + * [Available Programs](#available-programs) + * [hocr-check](#hocr-check) -- check the hOCR file for errors + * [hocr-combine](#hocr-combine) -- combine pages in multiple hOCR files into a single document + * [hocr-eval](#hocr-eval) -- compute number of segmentation and OCR errors + * [hocr-eval-geom](#hocr-eval-geom) -- compute over, under, and mis-segmentations + * [hocr-eval-lines](#hocr-eval-lines) -- compute OCR errors of hOCR output relative to text ground truth + * [hocr-extract-g1000](#hocr-extract-g1000) -- extract lines from Google 1000 book sample + * [hocr-extract-images](#hocr-extract-images) -- extract the images and texts within all the ocr_line elements + * [hocr-lines](#hocr-lines) -- extract the text within all the ocr_line elements + * [hocr-merge-dc](#hocr-merge-dc) -- merge Dublin Core meta data into the hOCR HTML header + * [hocr-pdf](#hocr-pdf) -- create a searchable PDF from a pile of hOCR and JPEG + * [hocr-split](#hocr-split) -- split an hOCR file into individual pages + +## About + +hOCR is a format for representing OCR output, including layout information, +character confidences, bounding boxes, and style information. +It embeds this information invisibly in standard HTML. +By building on standard HTML, it automatically inherits well-defined support +for most scripts, languages, and common layout options. +Furthermore, unlike previous OCR formats, the recognized text and OCR-related +information co-exist in the same file and survives editing and manipulation. +hOCR markup is independent of the presentation. There is a [Public Specification](http://docs.google.com/View?docid=dfxcv4vc_67g844kf) for the hOCR Format. -# Available Programs +### About the code + +Each command line program is self contained; if you have the right +Python packages installed, it should just work. (Unfortunately, that +means some code duplication; we may revisit this issue in later +revisions.) + +### Pointers + +The format itself is defined here: + +http://docs.google.com/View?docID=dfxcv4vc_67g844kf&revision=_latest + +## Installation + +### System-wide + +On a Debian/Ubuntu system, install the dependencies from packages: + +``` +sudo apt-get install python-lxml python-reportlab python-pil \ + python-beautifulsoup python-numpy python-scipy python-matplotlib +``` + +Or, to fetch dependencies from the [cheese shop](https://pypi.python.org/pypi): + +``` +sudo pip install -r requirements.txt # basic +``` + +Then install the dist: + +``` +sudo python setup.py install +``` + +### Virtualenv + +Once + +``` +virtualenv venv +source venv/bin/activate +pip install -r requirements.txt +``` + +Subsequently + +``` +source venv/bin/activate +./hocr-... +``` + +## Available Programs Included command line programs: - * hocr-check -- check the hOCR file for errors - * hocr-combine -- combine pages in multiple hOCR files into a single document - * hocr-eval -- compute number of segmentation and OCR errors - * hocr-eval-geom -- compute over, under, and mis-segmentations - * hocr-eval-lines -- compute OCR errors of hOCR output relative to text ground truth - * hocr-extract-images -- extract the images and texts within all the ocr_line elements - * hocr-lines -- extract the text within all the ocr_line elements - * hocr-pdf -- create a searchable PDF from a pile of hOCR and JPEG - * hocr-split -- split an hOCR file into individual pages - * hocr-merge-dc -- merge Dublin Core meta data into the hOCR HTML header +### hocr-check + +``` +hocr-check file.html +``` + +Perform consistency checks on the hOCR file. + +### hocr-combine + +``` +hocr-combine file1.html file2.html... +``` + +Combine the OCR pages contained in each HTML file into a single document. +The document metadata is taken from the first file. + +### hocr-eval-lines + +``` +hocr-eval-lines [-v] true-lines.txt hocr-actual.html +``` + +Evaluate hOCR output against ASCII ground truth. This evaluation method +requires that the line breaks in true-lines.txt and the ocr_line elements +in hocr-actual.html agree (most ASCII output from OCR systems satisfies this +requirement). + +### hocr-eval-geom + +``` +hocr-eval-geom [-e element-name] [-o overlap-threshold] hocr-truth hocr-actual +``` + +Compare the segmentations at the level of the element name (default: ocr_line). +Computes undersegmentation, oversegmentation, and missegmentation. + +### hocr-eval + +``` +hocr-eval hocr-true.html hocr-actual.html +``` + +Evaluate the actual OCR with respect to the ground truth. This outputs +the number of OCR errors due to incorrect segmentation and the number +of OCR errors due to character recognition errors. + +It works by aligning segmentation components geometrically, and for each +segmentation component that can be aligned, computing the string edit distance +of the text the segmentation component contains. + +### hocr-extract-g1000 + +Extract lines from [Google 1000 book sample](http://commondatastorage.googleapis.com/books/icdar2007/README.txt) + +### hocr-extract-images + +TODO + +### hocr-lines + +TODO + +### hocr-merge-dc + +``` +hocr-merge-dc dc.xml hocr.html > hocr-new.html +``` + +Merges the Dublin Core metadata into the hOCR file by encoding the data in its header. + +### hocr-pdf + +TODO + +### hocr-split + +``` +hocr-split file.html pattern +``` + +Split a multipage hOCR file into hOCR files containing one page each. +The pattern should something like "base-%03d.html" diff --git a/hocr-check b/hocr-check index d61224b..f8a1bc4 100755 --- a/hocr-check +++ b/hocr-check @@ -1,10 +1,9 @@ -#!/usr/bin/python +#!/usr/bin/env python # check the given file for conformance with the hOCR format spec import sys,os,string,re,getopt -from xml.dom.ext.reader import HtmlLib -from xml.xpath import Evaluate as xquery +from lxml import html ################################################################ ### misc library code @@ -18,7 +17,7 @@ def assoc(key,list): ### node properties def get_prop(node,name): - title = node.getAttributeNS(None,'title') + title = node.get('title') if not title: return None props = title.split(';') for prop in props: @@ -66,50 +65,50 @@ nooverlap = (assoc('-o',optlist)=='') if len(args)>0: stream = open(args[0]) elif len(args)>1: raise "can only check one file at a time" else: stream = sys.stdin -doc = HtmlLib.Reader().fromString(stream.read()) +doc = html.fromstring(stream.read()) ################################################################ ### XML structure checks ################################################################ # check for presence of meta information -assert xquery("//META[@name='ocr-id']",doc)!=[] -assert xquery("//META[@name='ocr-recognized']",doc)!=[] +assert doc.xpath("//meta[@name='ocr-id']")!=[] +assert doc.xpath("//meta[@name='ocr-recognized']")!=[] # check for presence of page -assert xquery("//*[@class='ocr_page']",doc)!=[] +assert doc.xpath("//*[@class='ocr_page']")!=[] # check that lines are inside pages -lines = xquery("//*[@class='ocr_line']",doc.documentElement) +lines = doc.xpath("//*[@class='ocr_line']") for line in lines: - assert xquery("//*[@class='ocr_page']",line) + assert line.xpath("//*[@class='ocr_page']") # check that pars are inside pages -pars = xquery("//*[@class='ocr_par']",doc.documentElement) +pars = doc.xpath("//*[@class='ocr_par']") for par in pars: - assert xquery("//*[@class='ocr_page']",par) + assert par.xpath("//*[@class='ocr_page']") # check that columns are inside pages -columns = xquery("//*[@class='ocr_column']",doc.documentElement) +columns = doc.xpath("//*[@class='ocr_column']") for column in columns: - assert xquery("//*[@class='ocr_page']",column) + assert column.xpath("//*[@class='ocr_page']") ################################################################ ### geometric checks ################################################################ if not nooverlap: - for page in xquery("//*[@class='ocr_page']",doc): + for page in doc.xpath("//*[@class='ocr_page']"): # check lines - objs = xquery("//*[@class='ocr_line']",page) + objs = page.xpath("//*[@class='ocr_line']") line_bboxes = [get_bbox(obj) for obj in objs if get_prop(obj,'bbox')] assert mostly_nonoverlapping(line_bboxes) # check paragraphs - objs = xquery("//*[@class='ocr_par']",page) + objs = page.xpath("//*[@class='ocr_par']") par_bboxes = [get_bbox(obj) for obj in objs if get_prop(obj,'bbox')] assert mostly_nonoverlapping(par_bboxes) # check columns - objs = xquery("//*[@class='ocr_column']",page) + objs = page.xpath("//*[@class='ocr_column']") column_bboxes = [get_bbox(obj) for obj in objs if get_prop(obj,'bbox')] assert mostly_nonoverlapping(column_bboxes) diff --git a/hocr-combine b/hocr-combine index e7e2782..767bfb7 100755 --- a/hocr-combine +++ b/hocr-combine @@ -1,18 +1,7 @@ -#!/usr/bin/python +#!/usr/bin/env python import sys,os,string,re -import xml -from xml.dom.ext.reader import HtmlLib -from xml.xpath import Evaluate as xquery - -################################################################ -### library code -################################################################ - -def get_text(node): - textnodes = xquery(".//text()",node) - s = string.join([node.nodeValue for node in textnodes]) - return re.sub(r'\s+',' ',s) +from lxml import html, etree ################################################################ ### main program @@ -23,20 +12,16 @@ if len(sys.argv)<2: sys.stderr.write("usage: %s file1.html file2.html...\n"%sys.argv[0]) sys.exit(1) -stream = open(sys.argv[1]) -doc = HtmlLib.Reader().fromString(open(sys.argv[1]).read()) +doc = html.fromstring(open(sys.argv[1]).read()) -pages = xquery("//*[@class='ocr_page']",doc) -container = pages[-1].parentNode +pages = doc.xpath("//*[@class='ocr_page']") +container = pages[-1].getparent() for fname in sys.argv[2:]: - doc2 = HtmlLib.Reader().fromString(open(fname).read()) - pages = xquery("//*[@class='ocr_page']",doc2) + doc2 = html.fromstring(open(fname).read()) + pages = doc2.xpath("//*[@class='ocr_page']") for page in pages: page = doc.importNode(page,1) - container.appendChild(page) - -xml.dom.ext.PrettyPrint(doc,sys.stdout) + container.append(page) - - +print(etree.tostring(doc, pretty_print=True)) diff --git a/hocr-eval b/hocr-eval index acd0e10..cb2f2ee 100755 --- a/hocr-eval +++ b/hocr-eval @@ -1,11 +1,11 @@ -#!/usr/bin/python +#!/usr/bin/env python # -*- coding: utf-8 -*- # compute statistics about the quality of the geometric segmentation # at the level of the given OCR element import sys,os,codecs,string,re,getopt -import Image,ImageDraw +from PIL import Image,ImageDraw import xml from BeautifulSoup import BeautifulSoup from pylab import array,zeros,reshape diff --git a/hocr-eval-geom b/hocr-eval-geom index 6b0e2fa..63072b9 100755 --- a/hocr-eval-geom +++ b/hocr-eval-geom @@ -1,4 +1,4 @@ -#!/usr/bin/python +#!/usr/bin/env python # compute statistics about the quality of the geometric segmentation # at the level of the given OCR element diff --git a/hocr-eval-lines b/hocr-eval-lines index 4355507..153daf5 100755 --- a/hocr-eval-lines +++ b/hocr-eval-lines @@ -1,4 +1,4 @@ -#!/usr/bin/python +#!/usr/bin/env python # compute statistics about the quality of the geometric segmentation # at the level of the given OCR element diff --git a/hocr-extract-g1000 b/hocr-extract-g1000 index b9aacea..8475f96 100755 --- a/hocr-extract-g1000 +++ b/hocr-extract-g1000 @@ -1,4 +1,4 @@ -#!/usr/bin/python +#!/usr/bin/env python # extract lines from Google 1000 book sample diff --git a/hocr-extract-images b/hocr-extract-images index c7263d6..625b993 100755 --- a/hocr-extract-images +++ b/hocr-extract-images @@ -1,12 +1,10 @@ -#!/usr/bin/python +#!/usr/bin/env python -# extract the text within all the ocr_line elements within the hOCR file +# extract the images and texts within all the ocr_line elements within the hOCR file -import sys,os,string,re,getopt -import xml +import sys,os,string,re,getopt,codecs from PIL import Image -from xml.dom.ext.reader import HtmlLib -from xml.xpath import Evaluate as xquery +from lxml import html def assoc(key,list): for k,v in list: @@ -14,20 +12,21 @@ def assoc(key,list): return None def get_text(node): - textnodes = xquery(".//text()",node) - s = string.join([node.nodeValue for node in textnodes]) + textnodes = node.xpath('.//text()') + s = string.join([text for text in textnodes]) return re.sub(r'\s+',' ',s) def get_prop(node,name): - title = node.getAttributeNS(None,'title') + title = node.get("title") props = title.split(';') for prop in props: (key,args) = prop.split(None,1) + args = args.strip('"') if key==name: return args return None if len(sys.argv)<2 and sys.stdin.isatty(): - print "usage: %s [-b image-dir] [-p file-pattern] [-e element-name] [hocr-file]"%sys.argv[0] + print "usage: %s [-b image-dir] [-p file-pattern] [-e element-name] hocr-file"%sys.argv[0] sys.exit(0) optlist,args = getopt.getopt(sys.argv[1:],"b:p:e:") print args @@ -36,14 +35,20 @@ basename = assoc('-b',optlist) pattern = assoc('-p',optlist) or 'line-%03d.png' element = assoc('-e',optlist) or 'ocr_line' +tpattern = pattern + '.txt' +if pattern[-4] == '.': + tpattern = pattern[:-3] + 'txt' + if len(args)>1: raise "too many args" if len(args)==1: stream = open(args[0]) else: stream = sys.stdin -doc = HtmlLib.Reader().fromString(stream.read()) -pages = xquery("//*[@class='ocr_page']",doc) +doc = html.fromstring(stream.read()) +pages = doc.xpath('//*[@class="ocr_page"]') for page in pages: iname = get_prop(page,'file') + if not iname: + iname = get_prop(page, 'image') if basename: iname = os.path.join(basename,os.path.basename(iname)) if not os.path.exists(iname): @@ -51,7 +56,7 @@ for page in pages: sys.exit(1) image = Image.open(iname) print image - lines = xquery("//*[@class='%s']"%element,page) + lines = page.xpath("//*[@class='%s']"%element) lcount = 1 for line in lines: bbox = [int(x) for x in get_prop(line,'bbox').split()] @@ -59,4 +64,7 @@ for page in pages: assert bbox[1]1: stream = open(sys.argv[1]) else: stream = sys.stdin -doc = HtmlLib.Reader().fromString(stream.read()) -lines = xquery("//*[@class='ocr_line']",doc.documentElement) +doc = html.fromstring(stream.read()) +lines = doc.xpath("//*[@class='ocr_line']") for line in lines: print get_text(line) diff --git a/hocr-merge-dc b/hocr-merge-dc index 0fca87a..ffd4a08 100755 --- a/hocr-merge-dc +++ b/hocr-merge-dc @@ -1,4 +1,4 @@ -#!/usr/bin/python +#!/usr/bin/env python import sys,os,string,re import xml diff --git a/hocr-pdf b/hocr-pdf index edd085f..93eedff 100755 --- a/hocr-pdf +++ b/hocr-pdf @@ -1,4 +1,4 @@ -#!/usr/bin/python +#!/usr/bin/env python # # Copyright 2013 Google Inc. All Rights Reserved. # diff --git a/hocr-split b/hocr-split index 74910b8..fa798ed 100755 --- a/hocr-split +++ b/hocr-split @@ -1,4 +1,4 @@ -#!/usr/bin/python +#!/usr/bin/env python # split an hOCR file into individual pages diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..9136776 --- /dev/null +++ b/requirements.txt @@ -0,0 +1,4 @@ +BeautifulSoup==3.2.1 +Pillow==3.1.1 +lxml==3.5.0 +reportlab==3.3.0 diff --git a/setup.py b/setup.py new file mode 100755 index 0000000..98d5d53 --- /dev/null +++ b/setup.py @@ -0,0 +1,11 @@ +#!/usr/bin/env python + +import glob +from setuptools import setup +setup( + name = "hocr_tools", + version = "0.1", + author = 'Thomas Breuel', + description = 'Advanced tools for hOCR integration', + scripts = [c for c in glob.glob("hocr-*")] +) diff --git a/tests/alice_1.png b/tests/alice_1.png new file mode 100644 index 0000000..a1a1f59 Binary files /dev/null and b/tests/alice_1.png differ diff --git a/dcsample.xml b/tests/dcsample.xml similarity index 100% rename from dcsample.xml rename to tests/dcsample.xml diff --git a/dcsample2.xml b/tests/dcsample2.xml similarity index 100% rename from dcsample2.xml rename to tests/dcsample2.xml diff --git a/tests/run.sh b/tests/run.sh new file mode 100755 index 0000000..198e82e --- /dev/null +++ b/tests/run.sh @@ -0,0 +1,5 @@ +#!/bin/sh + +../hocr-check sample.html +../hocr-extract-images -p "words-from-test-%03d.png" -e "ocrx_word" tess.hocr +rm words-from-test* diff --git a/sample.html b/tests/sample.html similarity index 98% rename from sample.html rename to tests/sample.html index de39aba..51ab60f 100644 --- a/sample.html +++ b/tests/sample.html @@ -7,7 +7,7 @@ -
+

1 Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the bank, diff --git a/sample.txt b/tests/sample.txt similarity index 100% rename from sample.txt rename to tests/sample.txt diff --git a/tests/tess.hocr b/tests/tess.hocr new file mode 100644 index 0000000..291d708 --- /dev/null +++ b/tests/tess.hocr @@ -0,0 +1,116 @@ + + + + + + + + + + + +
+
+

+ 1 Down the Rabbit-Hole + +

+
+
+

+ Alice was beginning to get very tired of sitting by her sister on the bank, + + and of having nothing to do: once or twice she had peeped into the book her + + sister was reading, but it had no pictures or conversations in it, ‘and what is + + the use of a book,’ thought Alice ‘Without pictures or conversation?’ + +

+ +

+ So she was considering in her own mind (as well as she could, for the hot + + day made her feel very sleepy and stupid), whether the pleasure of making a + + daisy-chain would be worth the trouble of getting up and picking the daisies, + + when suddenly a White Rabbit with pink eyes ran close by her. + +

+ +

+ There was nothing so VERY remarkable in that; nor did Alice think it so + + VERY much out of the way to hear the Rabbit say to itself, ‘Oh dear! Oh + + dear! I shall be late!’ (when she thought it over afterwards, it occurred to + + her that she ought to have wondered at this, but at the time it all seemed + + quite natural); but when the Rabbit actually TOOK A WATCH OUT OF + + ITS WAISTCOAT— POCKET, and looked at it, and then hurried on, Alice + + started to her feet, for it flashed across her mind that she had never before + + seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and + + burning with curiosity, she ran across the field after it, and fortunately was + + just in time to see it pop down a large rabbit-hole under the hedge. + +

+ +

+ In another moment down went Alice after it, never once considering how + + in the world she was to get out again. + +

+ +

+ The rabbit-hole went straight on like a tunnel for some way, and then + + dipped suddenly down, so suddenly that Alice had not a moment to think + + about stopping herself before she found herself falling down a very deep well. + +

+ +

+ Either the well was very deep, or she fell very slowly, for she had plenty + + of time as she went down to look about her and to wonder what was going + + to happen next. First, she tried to look down and make out what she was + + coming to, but it was too dark to see anything; then she looked at the sides + + of the well, and noticed that they were filled with cupboards and book- + + shelves; here and there she saw maps and pictures hung upon pegs. She took + + down a jar from one of the shelves as she passed; it was labelled ‘ORANGE + + MARMALADE’, but to her great disappointment it was empty: she did not + + like to drop the jar for fear of killing somebody, so managed to put it into + + one of the cupboards as she fell past it. + +

+ +

+ ‘Well!’ thought Alice to herself, ‘after such a fall as this, I shall think + + nothing of tumbling down stairs! How brave they’ll all think me at home! + + Why, I wouldn’t say anything about it, even if I fell off the top of the house!’ + +

+
+
+ +