Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADD script to create a simplified version of hocr-files #152

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

JKamlah
Copy link

@JKamlah JKamlah commented Jul 26, 2019

A script to create a simplified version of hocr-files.
It contains two main functions:

  • set a new maximum level of typesetting and remove the lower ones
  • remove unneeded properties

Copy link
Collaborator

@zuphilip zuphilip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this fits well into the scope of the hocr-tools and looks good in general. Thank you @JKamlah for this nice PR! CC @kba @stweil @tmbdev for comments about a new script.

Some comments below from my review and we may want to test it a little bit further. Possibly we have to do something more about words separated into two lines by a hyphen. Moreover, if we have information about glyphs and alternatives, then the text content is maybe repeating some words etc.

Finally, the README would need to update as well.

hocr-simplify Outdated
if key in args.remove_properties:
if args.verbose:
print("Replaced :{}".format(title))
title = title.replace(prop + ";", "").strip()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not work when the property is the last one (no semi-colon then).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, you can also try something like this, which looks much shorter (code not yet tested):

title = node.get("title")
title = re.sub(r"\s?(%s)\s+[^;$];?*" % args.remove_properties.join("|"), "")

BTW don't you have to save it back in the doc somehow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but we don't need to parse it in details, we just have to delete the parameters together with their values, which are not needed anymore.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions. If reworked this part, but without regexp. Also i had to replace the double quotation with single ones.

node.set('title', ';'.join([prop.replace("\"","'") for prop in title.split(";") if prop.strip().split(None, 1)[0] not in args.remove_properties]))

hocr-simplify Outdated

parser.add_argument('file', nargs='?', default=sys.stdin)
parser.add_argument('-t', '--typesetting', type=str,
choices=['glyph', 'word', 'line', 'par', 'carea', 'page'],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the choice glyph doing anything for simplification? I haven't seen an hocr-example where there was an element inside a ocr-glyph.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought i would need them, to remove char choices, but i've implemented it in another place. So i removed the "glyph" typesetting option.

hocr-simplify Outdated
parser.add_argument('-r', '--remove-properties', nargs='+',
help='List of properties: {}'.format(','.join(properties)))
parser.add_argument('fileout', nargs='?',
help="Outputpath, default: print to terminal")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Outputpath/Output path/

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also in the comment below.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solved.

hocr-simplify Outdated
for node in doc.xpath("//*[@title]"):
title = node.get("title")
for prop in title.split(";"):
(key, args) = prop.strip().split(None, 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you use None here and not the white-space character to split key and value?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fair i've took this part from hocr-cut.

hocr-simplify Outdated
if args.verbose:
print("Replaced :{}".format(title))
title = title.replace(prop + ";", "").strip()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have to update the ocr-capabilities meta tag.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solved.

'imagemd5', 'lpageno', 'ppageno', 'nlp', 'order', 'poly',
'scan_res', 'textangle', 'x_booxes', 'x_font', 'x_fsize',
'x_confs', 'x_scanner', 'x_source', 'x_wconf']

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have also an option to delete id and/or dir parameter, but they are on their own.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing attributes is now implemented

TESTDATA="../testdata"
SIMPLEFILE="./tess.simple.hocr"

plan 5
Copy link
Collaborator

@zuphilip zuphilip Jul 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the number of test cases, i.e. should be 2 here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed Plan 5 to Plan 3. I added two more test case, with the new char choice options.

Copy link
Contributor

@kba kba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would appreciate some documentation to understand the use cases better. Some more examples would make it easier to test more extensively to catch edge cases like @zuphilip lists.

But in general, it LGTM.

@zuphilip
Copy link
Collaborator

One use case is to make the hocr-output of tesseract and ocropy look more equally. Then, in a complex workflow where you used ocropy before, you then can also use tesseract + hocr-simplify instead.

Copy link
Author

@JKamlah JKamlah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the great review. I hope i have fixed all mentioned problem positions. In the new version i added some new feature:

  • remove attributes
  • remove empty contents
  • remove choices

The idea behind simple hocr is like @zuphilip said to make the output look more equally, to optimize the size for the needs and the option to derive a new version without performing the ocr again. E.g. this could be handy if someone works with tesseract outputs with char choices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants