Skip to content

Commit

Permalink
bug: PDF file upload failed - Could not initialize tesseract
Browse files Browse the repository at this point in the history
Was getting error unstructured_pytesseract.pytesseract.TesseractError: (1, 'Error opening data file /usr/share/tessdata/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \'eng\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')
  • Loading branch information
azaylamba committed Dec 7, 2024
1 parent b9ba6e3 commit 0f1576a
Showing 1 changed file with 3 additions and 0 deletions.
3 changes: 3 additions & 0 deletions lib/shared/file-import-dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,9 @@ RUN pip uninstall -y `pip freeze | grep torch` && pip uninstall -y `pip freeze |
# Torch is needed for image analysis in pdfs (using CPU version)
RUN pip install torch==2.3.0+cpu -f https://download.pytorch.org/whl/torch_stable.html

# This is required to process the pdf files produced by 'Microsoft: Print to PDF'
RUN apk add --no-cache tesseract-eng

# Remove previous layers to create a smaller image
FROM scratch
COPY --from=source / /
Expand Down

0 comments on commit 0f1576a

Please sign in to comment.