Support for non scanned documents (.doc, .docx, regular pdf) #15

rcatajar · 2016-09-05T15:31:19Z

Hi @jlsutherland and thanks for this cool module, OCR is a hard problem and you provide a pretty efficient and simple solution.

Would you be interested by PR with text extraction for non-scanned documents ? I think it fits the module name "doc2text" quite well but maybe you want to stick with just OCR, let me know

rcatajar · 2016-09-05T15:32:20Z

We already have a bunch of codes for that in my company that I'm going to refactor, so I can provides a PR. (we mainly use textract (http://textract.readthedocs.io/en/latest/) with a few tricks)

jlsutherland · 2016-09-05T18:01:12Z

Hi @rcatajar, thank you for the complement, and thank you for your contributions!

Yes, I think that would be very useful and would be interested in a PR.

I have a few ideas on how we might fold in the code. For instance, it could be useful to see if a document has any (readable) extractable embedded text before doing the transformations.

Do you think you could put something together?

rcatajar · 2016-09-07T09:08:54Z

I have a busy week but I'll take a look and submit a PR by the end of the week

jlsutherland · 2016-09-09T12:56:47Z

Hey @rcatajar, wanted to check in. How's it coming?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for non scanned documents (.doc, .docx, regular pdf) #15

Support for non scanned documents (.doc, .docx, regular pdf) #15

rcatajar commented Sep 5, 2016

rcatajar commented Sep 5, 2016

jlsutherland commented Sep 5, 2016

rcatajar commented Sep 7, 2016 •

edited

Loading

jlsutherland commented Sep 9, 2016

Support for non scanned documents (.doc, .docx, regular pdf) #15

Support for non scanned documents (.doc, .docx, regular pdf) #15

Comments

rcatajar commented Sep 5, 2016

rcatajar commented Sep 5, 2016

jlsutherland commented Sep 5, 2016

rcatajar commented Sep 7, 2016 • edited Loading

jlsutherland commented Sep 9, 2016

rcatajar commented Sep 7, 2016 •

edited

Loading