Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Truncated File error #373

Open
libgober opened this issue Mar 19, 2021 · 1 comment
Open

Truncated File error #373

libgober opened this issue Mar 19, 2021 · 1 comment

Comments

@libgober
Copy link

I am trying to extract text from hundreds of thousands of PDFs using a computer cluster. I want to run commands like

textract cl-exec-201666USCOC.pdf -o test1.txt -m tesseract

where the example PDF is from https://www.ncua.gov/files/comment-letters/2016/cl-exec-201666USCOC.pdf.

When I run this command I get a the following message:

The command tesseract /tmp/tmpwxgqr7lj/conv-4.ppm stdout failed with exit code 1
------------- stdout -------------
b''------------- stderr -------------
b'Error in findFileFormatStream: truncated file\nError during processing.\n'

In order to run the code, the cluster requires us to make a Docker Instance. Here's the docker file I have.

FROM python:3.7
RUN echo 'deb http://ftp.us.debian.org/debian stretch main contrib non-free' >> /etc/apt/sources.list
RUN apt-get update -y && apt-get install -y vim python-dev pstotext
libxml2-dev libxslt1-dev antiword unrtf
poppler-utils tesseract-ocr
flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig
RUN pip install ipython textract pandas tqdm beautifulsoup4 joblib

@BenjaminArmijo3
Copy link

have you already fixed it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants