Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

it'd be nice if this could produce text-overlaid PDFs #10

Open
jbothma opened this issue Aug 31, 2016 · 7 comments
Open

it'd be nice if this could produce text-overlaid PDFs #10

jbothma opened this issue Aug 31, 2016 · 7 comments

Comments

@jbothma
Copy link

jbothma commented Aug 31, 2016

tesseract seems to be able to produce PDFs these days with text overlaid on the image. This is useful for searching int he PDF when viewing that way.

It'd be nice if this could produce nice de-skewed PDFs

@jlsutherland
Copy link
Owner

Definitely. I think it would relatively straightforward to integrate. Would suggest building the text insertion into the Page class and then put a export_to_pdf() method on the Document class.

@jlsutherland
Copy link
Owner

Would you be interested in contributing @jbothma ?

@jbothma
Copy link
Author

jbothma commented Sep 2, 2016

Yup - would love to. Won't get to it before next week but will start a PR when I can :)

It's part of the ocr command as an optional output format so not sure what the right place would be to integrate it with doc2text.

@jlsutherland
Copy link
Owner

Awesome, thank you!

The method's location in the code would be conditional on the way tesseract embeds that data. Does tesseract insert the data into a PDF, or it in a separate state that contains the text and placement information?

In the first case, we would need the method you mentioned that produces a nicely optimized pdf from the images first, then the embedding second. We need this method regardless, I think. In the second case, we could run the tesseract embed method at any time after we produce the fixed image crop.

Thoughts?

@jbothma
Copy link
Author

jbothma commented Sep 15, 2016

So this is basically what I was talking about.

  • doc2text's existing functionality to straighten and flatten and normalise would run first,
  • product a multipage tif or whatever,
  • then give to tesseract to OCR with pdf config file (for pdf output).
wget http://mfma.treasury.gov.za/MFMA/Urban%20Development%20Zones/Gazette%20No.%2026866.pdf
gs -dNOPAUSE -q -r500 -sDEVICE=tiffg4 -dBATCH -sOutputFile=test.tif  Gazette\ No.\ 26866.pdf
tesseract test.tif outbase pdf

produces https://www.scribd.com/document/324084564/Out-Base

@jbothma
Copy link
Author

jbothma commented Sep 15, 2016

Tesseract produces the PDF already, so you'd select that as the output format of the OCR step. There's no intermediate hOCR or anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@jbothma @jlsutherland and others