-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for non scanned documents (.doc, .docx, regular pdf) #15
Comments
We already have a bunch of codes for that in my company that I'm going to refactor, so I can provides a PR. (we mainly use textract (http://textract.readthedocs.io/en/latest/) with a few tricks) |
Hi @rcatajar, thank you for the complement, and thank you for your contributions! Yes, I think that would be very useful and would be interested in a PR. I have a few ideas on how we might fold in the code. For instance, it could be useful to see if a document has any (readable) extractable embedded text before doing the transformations. Do you think you could put something together? |
I have a busy week but I'll take a look and submit a PR by the end of the week |
Hey @rcatajar, wanted to check in. How's it coming? |
Hi @jlsutherland and thanks for this cool module, OCR is a hard problem and you provide a pretty efficient and simple solution.
Would you be interested by PR with text extraction for non-scanned documents ? I think it fits the module name "doc2text" quite well but maybe you want to stick with just OCR, let me know
The text was updated successfully, but these errors were encountered: