Skip to content

BastinFlorian/OCR-LLM

Repository files navigation

OCR-LLM

OCR-LLM extracts texts from PDFs using Document AI, fix bad structure extraction using GPT 4 and answer informations on text using GPT 4.

You can use this repository to extracts informations in a structured way. The config/prompt/pycan be modified to fit your use case

How it works

The app uses the Google Cloud Document AI to extract the text from the PDFs. Then, it uses GPT 4 to fix the structure of the text and answer questions about the text.

How it works

Prerequisites

poetry init # init poetry with the default settings
poetry lock --no-update # lock the dependencies
poetry shell # emulate the created environment
poetry run python -m ipykernel install --user --name .ocr # Add the environment to jupyter if you try the tutorials

Use

Try the Streamlit app:

poetry run streamlit run app.py

Improvements

Note that this implementation is very basic and need to be improved to be used in production:

  • Remove all the bad summaries from page that does not contains informations you want to extract
  • Choose the best table structure or output a json information summary instead of a markdown table (used for demonstration purpose)

Licence

This project is licensed under the terms of the Apache 2.0 license.

Releases

No releases published

Packages

No packages published

Languages