Skip to content

This program can process all the .pdf and image files in the folder and extract the table within them.

Notifications You must be signed in to change notification settings

roderick1014/table-extraction

Repository files navigation

table-extraction

📝This program can process all the .pdf and image files in the folder and extract the table within them.

🔰Table Extraction Program by Roderick & Kevin for internship project in Capacura GmbH.


Progress (2023/03/08)

  • Image/PDF preprocessing
  • Hough line transformation
  • Intersections processing
  • Table formulation based on intersections
  • Text recognition based on OCR
  • Auto-rotation pages
  • Short column line detection problem
  • ⬜️ Underline problem
  • ⬜️ Thin line problem
  • ⬜️ Multi-table problem
  • ⬜️ Repeated header in consecutive pages
  • ⬜️ Truncated texts in consecutive pages

Python Version

  • 3.7.10

Packages

  • numpy (1.21.6)

  • pandas (1.3.5)

  • tqdm (4.61.2)

  • opencv-python (4.5.4.60)

  • pdf2image (1.16.2)

    • Tutorial
    • Remember to add "poppler" to environmental variables. (Windows)
  • PyTesseract (5.3.0)

    • Tutorial 1

    • Tutorial 2

    • MacOS Instructions

      1. Install the python package for Tesseract

      pip3 install PyTesseract  

      2. Install Tesseract

      sudo port install Tesseract    

      3. Install the Tesseract package required by OCR and auto rotation

      sudo port install tesseract-eng
      sudo port install tesseract-deu
      sudo port install tesseract-osd

      4. Set the TESSDATA_PREFIX environment variable to the Tesseract data directory

      export TESSDATA_PREFIX=/opt/local/share/tessdata/

Instructions

  • To save each page in the PDF file, add --SAVE_EACH_PAGE:

    python .\table_extraction.py --SAVE_EACH_PAGE
  • To draw and visualize the houghline & intersection dots in the image, add --DRAW:

    python .\table_extraction.py --DRAW
  • To specify the threshold of the houghline, add --THRESHOLD 'number':

    python .\table_extraction.py --THRESHOLD 1300
  • To specify the DPI parameter, add --DPI 'number':

    python .\table_extraction.py --DPI 200
  • To specify the folder containing the pdf files:

    python .\table_extraction.py --FILES_DIR ./

About

This program can process all the .pdf and image files in the folder and extract the table within them.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages