We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I am trying to extract text from hundreds of thousands of PDFs using a computer cluster. I want to run commands like
textract cl-exec-201666USCOC.pdf -o test1.txt -m tesseract
where the example PDF is from https://www.ncua.gov/files/comment-letters/2016/cl-exec-201666USCOC.pdf.
When I run this command I get a the following message:
The command tesseract /tmp/tmpwxgqr7lj/conv-4.ppm stdout failed with exit code 1 ------------- stdout ------------- b''------------- stderr ------------- b'Error in findFileFormatStream: truncated file\nError during processing.\n'
tesseract /tmp/tmpwxgqr7lj/conv-4.ppm stdout
In order to run the code, the cluster requires us to make a Docker Instance. Here's the docker file I have.
FROM python:3.7 RUN echo 'deb http://ftp.us.debian.org/debian stretch main contrib non-free' >> /etc/apt/sources.list RUN apt-get update -y && apt-get install -y vim python-dev pstotext libxml2-dev libxslt1-dev antiword unrtf poppler-utils tesseract-ocr flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig RUN pip install ipython textract pandas tqdm beautifulsoup4 joblib
The text was updated successfully, but these errors were encountered:
have you already fixed it?
Sorry, something went wrong.
No branches or pull requests
I am trying to extract text from hundreds of thousands of PDFs using a computer cluster. I want to run commands like
textract cl-exec-201666USCOC.pdf -o test1.txt -m tesseract
where the example PDF is from https://www.ncua.gov/files/comment-letters/2016/cl-exec-201666USCOC.pdf.
When I run this command I get a the following message:
The command
tesseract /tmp/tmpwxgqr7lj/conv-4.ppm stdout
failed with exit code 1------------- stdout -------------
b''------------- stderr -------------
b'Error in findFileFormatStream: truncated file\nError during processing.\n'
In order to run the code, the cluster requires us to make a Docker Instance. Here's the docker file I have.
FROM python:3.7
RUN echo 'deb http://ftp.us.debian.org/debian stretch main contrib non-free' >> /etc/apt/sources.list
RUN apt-get update -y && apt-get install -y vim python-dev pstotext
libxml2-dev libxslt1-dev antiword unrtf
poppler-utils tesseract-ocr
flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig
RUN pip install ipython textract pandas tqdm beautifulsoup4 joblib
The text was updated successfully, but these errors were encountered: