Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Textract dependency issue; Wagtail version dependency #22

Open
DanielSwain opened this issue Apr 9, 2019 · 5 comments
Open

Textract dependency issue; Wagtail version dependency #22

DanielSwain opened this issue Apr 9, 2019 · 5 comments

Comments

@DanielSwain
Copy link
Contributor

DanielSwain commented Apr 9, 2019

I’m working to set up Wagtail Textract. I use pipenv and was getting package mismatch errors due to Textract on PyPI not being updated with the latest repo from https://github.com/deanmalmgren/textract (there was a chardet dependency error). However, @deanmalmgren ’s repo DOES have an updated chardet dependency (3.0.4, the latest at this point), so I was able to get around all but one of the errors by installing directly from the repo:
pip install git+https://github.com/deanmalmgren/textract.git –-upgrade

One remaining error (I’m at the latest Wagtail, 2.4):

wagtail-textract 1.0 has requirement wagtail<2.2,>=2, but you'll have wagtail 2.4 which is incompatible.

Would you be willing to remove the wagtail<2.2 dependency? If not, I could do a little testing for you by forking and removing that dependency and installing from my fork, but my testing wouldn’t be extensive. I would have around a hundred documents that I could run the transcription command on, but none of them would require OCR.

I would be willing to propose a re-write of your installation instructions based on the above (you could likely get rid of having to mention the statements about incompatibility errors).

@khink
Copy link
Contributor

khink commented Apr 10, 2019

@DanAtShenTech Yes, i'm completely okay with removing that restriction. Not sure why it's in there, maybe a conservative move. But it looks like there's no reason for it now.

We'd have to update the build matrix as well.

I'd be happy to accept a PR.

@DanielSwain
Copy link
Contributor Author

I've submitted a PR. Would you be willing to update the install script to install from git+https://github.com/deanmalmgren/textract.git rather than from PyPI? I imagine this is non-standard, but if done, then in the install instructions I could remove the notes about errors and add a note to mention that installation of textract is from @deanmalmgren's github repo due to the PyPI resource not being kept up-to-date.

@khink
Copy link
Contributor

khink commented Apr 13, 2019

Hi Dan,

That does not seem the proper solution. But maybe you could document the issues you have with textract itself, and show how users can install it directly from VCS to solve theses issues, in the README?

@DanielSwain
Copy link
Contributor Author

OK Kees. As soon as you post to PyPI, I'll go through the whole process of installing and then provide a PR for an update to the README.

@DanielSwain
Copy link
Contributor Author

DanielSwain commented Apr 17, 2019

I wanted to bring to the attention of anyone reading this issue some information that I just discovered. Back in 2016 @deanmalmgren called for someone to take over the Textract repo. He tweeted about this need as recently as April 9, 2019. A review of his commit history shows his last commit to the Textract repo was the summer of 2017. While I've been able to get document extraction capability to work somewhat well using wagtail_textract, it feels pretty brittle. I still haven't gotten OCR to work when uploading a file though, and OCR'ed data is not saved with the PDF - see this issue. Also, I use pipenv and can't yet produce a Pipfile.lock to use in production because of dependency issues related to the repo not being kept up-to-date. I'm not at a point that I could take over maintenance of this repo, but I wanted to particularly point this problem out to @khink in case he is. One dependency that it would be nice to update would be to move from Tesseract 3.x to the latest 4.x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants