Vincent Rasneur [email protected]
- pdftk
- ghostscript
- imagemagick
- tesseract
- aspell (optional)
By default, the script uses the French dictionaries of tesseract and aspell.
Use the -t
argument to change the tesseract dictionary.
Use the -a
argument to change the aspell dictionary.
By default, the script does not spell-check the output text. To do this, you must add -s
(or use the -a
argument).
To OCR a PDF file
ocr.sh document.pdf
To OCR a PDF file and spell-check each page
ocr.sh -s document.pdf
To OCR an english PDF and spell-check it
ocr.sh -t eng -a en document.pdf
For a PDF file named doc1.pdf
, the script:
- creates a directory named
doc1
- for each PDF page, a file named
pg_<number>.txt
is created inside this directory
Or, if the -c
argument is used, the script:
- creates a directory named
doc1
- creates a unique file named
doc1/doc1.txt