Skip to content

Calculate font size

Philipp Zumstein edited this page Apr 23, 2017 · 1 revision

Tesseract's hocr output provides some information about the x_size (see x-height) of the recognized text (together with information about ascender and descender). It is also possible to activate the hocr_font_info to become some information about the font size as well. However, the font size is then rounded to an integer which is not always what one want.

The calculation for the font size is easy and independent of any other information such that we can do this again from the hocr file (with given x_size parameter for each line) without any rounding, cf. this simple hack:

$ perl -ne 'print("$1 ", $2*72/600, "\n") if /^.*id=.([^ ]*). .*x_size ([0-9.]*);.*$/' h7.html
line_1_1 8.62807344
line_1_2 7.08
line_1_3 6.36
line_1_4 6.36
line_1_5 6.36
line_1_6 6.35710104
line_1_7 6.48
line_1_8 6.36
line_1_9 6.24
line_1_10 6.36
...

If your image has some other resolution, then substitute the "600" above with that.

Source: http://stackoverflow.com/questions/43531282/getting-exact-font-size-in-hocr-output/

Clone this wiki locally