table-extraction

📝This program can process all the .pdf and image files in the folder and extract the table within them.

🔰Table Extraction Program by Roderick & Kevin for internship project in Capacura GmbH.

Progress (2023/03/08)

✅ Image/PDF preprocessing
✅ Hough line transformation
✅ Intersections processing
✅ Table formulation based on intersections
✅ Text recognition based on OCR
✅ Auto-rotation pages
✅ Short column line detection problem
⬜️ Underline problem
⬜️ Thin line problem
⬜️ Multi-table problem
⬜️ Repeated header in consecutive pages
⬜️ Truncated texts in consecutive pages

Python Version

3.7.10

Packages

numpy (1.21.6)
pandas (1.3.5)
tqdm (4.61.2)
opencv-python (4.5.4.60)
pdf2image (1.16.2)
- Tutorial
- Remember to add "poppler" to environmental variables. (Windows)
PyTesseract (5.3.0)
- Tutorial 1
- Tutorial 2
- MacOS Instructions
  
  1. Install the python package for Tesseract
```
pip3 install PyTesseract  
```
  2. Install Tesseract
```
sudo port install Tesseract    
```
  3. Install the Tesseract package required by OCR and auto rotation
```
sudo port install tesseract-eng
sudo port install tesseract-deu
sudo port install tesseract-osd
```
  4. Set the TESSDATA_PREFIX environment variable to the Tesseract data directory
```
export TESSDATA_PREFIX=/opt/local/share/tessdata/
```

Instructions

To save each page in the PDF file, add --SAVE_EACH_PAGE:
```
python .\table_extraction.py --SAVE_EACH_PAGE
```
To draw and visualize the houghline & intersection dots in the image, add --DRAW:
```
python .\table_extraction.py --DRAW
```
To specify the threshold of the houghline, add --THRESHOLD 'number':
```
python .\table_extraction.py --THRESHOLD 1300
```
To specify the DPI parameter, add --DPI 'number':
```
python .\table_extraction.py --DPI 200
```
To specify the folder containing the pdf files:
```
python .\table_extraction.py --FILES_DIR ./
```

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
README.md		README.md
rotated_example.json		rotated_example.json
rotated_example.pdf		rotated_example.pdf
shareholder_list_1.json		shareholder_list_1.json
shareholder_list_1.pdf		shareholder_list_1.pdf
shareholder_list_2.json		shareholder_list_2.json
shareholder_list_2.pdf		shareholder_list_2.pdf
table_extraction.py		table_extraction.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

table-extraction

📝This program can process all the .pdf and image files in the folder and extract the table within them.

🔰Table Extraction Program by Roderick & Kevin for internship project in Capacura GmbH.

Progress (2023/03/08)

Python Version

Packages

1. Install the python package for Tesseract

2. Install Tesseract

3. Install the Tesseract package required by OCR and auto rotation

4. Set the TESSDATA_PREFIX environment variable to the Tesseract data directory

Instructions

About

Releases

Packages

Contributors 2

Languages

roderick1014/table-extraction

Folders and files

Latest commit

History

Repository files navigation

table-extraction

📝This program can process all the .pdf and image files in the folder and extract the table within them.

🔰Table Extraction Program by Roderick & Kevin for internship project in Capacura GmbH.

Progress (2023/03/08)

Python Version

Packages

1. Install the python package for Tesseract

2. Install Tesseract

3. Install the Tesseract package required by OCR and auto rotation

4. Set the TESSDATA_PREFIX environment variable to the Tesseract data directory

Instructions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages