Truncated File error #373

libgober · 2021-03-19T16:22:33Z

I am trying to extract text from hundreds of thousands of PDFs using a computer cluster. I want to run commands like

textract cl-exec-201666USCOC.pdf -o test1.txt -m tesseract

where the example PDF is from https://www.ncua.gov/files/comment-letters/2016/cl-exec-201666USCOC.pdf.

When I run this command I get a the following message:

The command tesseract /tmp/tmpwxgqr7lj/conv-4.ppm stdout failed with exit code 1
------------- stdout -------------
b''------------- stderr -------------
b'Error in findFileFormatStream: truncated file\nError during processing.\n'

In order to run the code, the cluster requires us to make a Docker Instance. Here's the docker file I have.

FROM python:3.7
RUN echo 'deb http://ftp.us.debian.org/debian stretch main contrib non-free' >> /etc/apt/sources.list
RUN apt-get update -y && apt-get install -y vim python-dev pstotext
libxml2-dev libxslt1-dev antiword unrtf
poppler-utils tesseract-ocr
flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig
RUN pip install ipython textract pandas tqdm beautifulsoup4 joblib

The text was updated successfully, but these errors were encountered:

BenjaminArmijo3 · 2022-07-07T23:12:40Z

have you already fixed it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Truncated File error #373

Truncated File error #373

libgober commented Mar 19, 2021

BenjaminArmijo3 commented Jul 7, 2022

Truncated File error #373

Truncated File error #373

Comments

libgober commented Mar 19, 2021

BenjaminArmijo3 commented Jul 7, 2022