PDFscraper

PDFscraper uses PDFMiner and Python Tesseract to text mine pdfs.

Requirements

PDFscraper requires python 3.x

The following python packages are prerequisites:

pdfminer.six
pytesseract
chardet
Python Imaging Library (PIL) or Pillow
pdf2image

Other requirements: Install of Google Tesseract OCR and Poppler

Usage

usage: pdfscraper.py [-h] -i INPDF -o OUTTXT [-t]

optional arguments:
  -h, --help            show this help message and exit
  -i INPDF, --input-dir INPDF
                        Path to the input pdf files
  -o OUTTXT, --output-dir OUTTXT
                        Path for the output txt files
  -t, --token-gen       Use flag to generate tokenized output

E.g. To run

python pdfscraper.py -i /path/to/input/pdfs -o /path/to/output/directory

PDFscraper also has an optional flag -t, which produces tokenized text for use in Natural Language Processing (NLP) tasks. E.g. to produce tokenized output:

python pdfscraper.py -i /path/to/input/pdfs -o /path/to/output/directory -t

Docker

Alternatively, the accompanying Dockerfile can be used to run the program in a docker container.

E.g. To run

docker run -v "/path/to/input/pdfs:/data" --rm pdfscraper:latest -i /data -o /data

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
Dockerfile		Dockerfile
README.md		README.md
pdfscraper.py		pdfscraper.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDFscraper

Requirements

Usage

Docker

About

Releases

Packages

Languages

annacprice/pdf-scraper

Folders and files

Latest commit

History

Repository files navigation

PDFscraper

Requirements

Usage

Docker

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages