📝This program can process all the .pdf and image files in the folder and extract the table within them.
- ✅ Image/PDF preprocessing
- ✅ Hough line transformation
- ✅ Intersections processing
- ✅ Table formulation based on intersections
- ✅ Text recognition based on OCR
- ✅ Auto-rotation pages
- ✅ Short column line detection problem
- ⬜️ Underline problem
- ⬜️ Thin line problem
- ⬜️ Multi-table problem
- ⬜️ Repeated header in consecutive pages
- ⬜️ Truncated texts in consecutive pages
- 3.7.10
-
numpy (1.21.6)
-
pandas (1.3.5)
-
tqdm (4.61.2)
-
opencv-python (4.5.4.60)
-
pdf2image (1.16.2)
- Tutorial
- Remember to add "poppler" to environmental variables. (Windows)
-
PyTesseract (5.3.0)
-
MacOS Instructions
pip3 install PyTesseract
sudo port install Tesseract
sudo port install tesseract-eng sudo port install tesseract-deu sudo port install tesseract-osd
export TESSDATA_PREFIX=/opt/local/share/tessdata/
-
To save each page in the PDF file, add --SAVE_EACH_PAGE:
python .\table_extraction.py --SAVE_EACH_PAGE
-
To draw and visualize the houghline & intersection dots in the image, add --DRAW:
python .\table_extraction.py --DRAW
-
To specify the threshold of the houghline, add --THRESHOLD 'number':
python .\table_extraction.py --THRESHOLD 1300
-
To specify the DPI parameter, add --DPI 'number':
python .\table_extraction.py --DPI 200
-
To specify the folder containing the pdf files:
python .\table_extraction.py --FILES_DIR ./