This Python script extracts texts from multiple image files, combines the texts and saves them in txt and MS docx file types.
The following dependencies are required to run this script
- Python 3
- Pillow (PIL) library
- pytesseract library
- Tesseract OCR engine
- docx library
-
On VsCode Terminal
Install Pillow, an image processing library in python, using this command
`pip install pillow`
Install tesserct-ocr using this command
`sudo apt install tesseract-ocr`
Install pytesseract, Python wrapper for tesserct-ocr using this command
`pip install pytesseract`
Install docx, a library that allows creating and updating MS Word (.docx) files.
`pip install python-docx`
-
Clone or download this repository to your local machine.
-
Place your image files in the imagetotxt folder.
-
List the names of the image files you want to convert in the ordered.txt file, this is in the order you want the images to be processed.
-
Run the script using the following command:
`/bin/python3 /your_directory/TextGenerator-Tesseract/image_example.py`
-
The output will be saved in output.txt and output.docx files.
- I have only included images in JPEG, JPG, and PNG format.
- You can read more on Tesseract Documentation