Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for non scanned documents (.doc, .docx, regular pdf) #15

Open
rcatajar opened this issue Sep 5, 2016 · 4 comments
Open

Support for non scanned documents (.doc, .docx, regular pdf) #15

rcatajar opened this issue Sep 5, 2016 · 4 comments

Comments

@rcatajar
Copy link
Contributor

rcatajar commented Sep 5, 2016

Hi @jlsutherland and thanks for this cool module, OCR is a hard problem and you provide a pretty efficient and simple solution.

Would you be interested by PR with text extraction for non-scanned documents ? I think it fits the module name "doc2text" quite well but maybe you want to stick with just OCR, let me know

@rcatajar
Copy link
Contributor Author

rcatajar commented Sep 5, 2016

We already have a bunch of codes for that in my company that I'm going to refactor, so I can provides a PR. (we mainly use textract (http://textract.readthedocs.io/en/latest/) with a few tricks)

@jlsutherland
Copy link
Owner

Hi @rcatajar, thank you for the complement, and thank you for your contributions!

Yes, I think that would be very useful and would be interested in a PR.

I have a few ideas on how we might fold in the code. For instance, it could be useful to see if a document has any (readable) extractable embedded text before doing the transformations.

Do you think you could put something together?

@rcatajar
Copy link
Contributor Author

rcatajar commented Sep 7, 2016

I have a busy week but I'll take a look and submit a PR by the end of the week

@jlsutherland
Copy link
Owner

Hey @rcatajar, wanted to check in. How's it coming?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants