OCR-LLM

OCR-LLM extracts texts from PDFs using Document AI, fix bad structure extraction using GPT 4 and answer informations on text using GPT 4.

You can use this repository to extracts informations in a structured way. The config/prompt/pycan be modified to fit your use case

How it works

The app uses the Google Cloud Document AI to extract the text from the PDFs. Then, it uses GPT 4 to fix the structure of the text and answer questions about the text.

Prerequisites

Poetry
A GCP account with the Document AI API enabled and a Document AI processor created

poetry init # init poetry with the default settings
poetry lock --no-update # lock the dependencies
poetry shell # emulate the created environment
poetry run python -m ipykernel install --user --name .ocr # Add the environment to jupyter if you try the tutorials

Use

Try the Streamlit app:

poetry run streamlit run app.py

Improvements

Note that this implementation is very basic and need to be improved to be used in production:

Remove all the bad summaries from page that does not contains informations you want to extract
Choose the best table structure or output a json information summary instead of a markdown table (used for demonstration purpose)

Licence

This project is licensed under the terms of the Apache 2.0 license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

OCR-LLM

How it works

Prerequisites

Use

Improvements

Licence

Files

README.md

Latest commit

History

README.md

File metadata and controls

OCR-LLM

How it works

Prerequisites

Use

Improvements

Licence