Layout-aware document preprocessing for RAG

RAG based systems are a powerful approach to extend the context of LLM requests with additional relevant information. Just chunking the text of documents into sentences or paragraphs is not enough to maintain the full context of the information.

Looking at a research paper for example, it can make a huge difference if a statement like ""The neural network achieved an accuracy of 95% on the test" is under the "Related work" or "Results" section. In the Related Work section it means that this level of accuracy has already been achieved in earlier studies. When the same statement appears in the "Results" section, it carries a different significance. It represents a finding from a current study and suggests that the researchers have developed a new or improved method.

In this repo we show how you can extend RAG systems by considering additional semantics using Layout-aware preprocessing. We utilize Amazon Textract's layout feature. This feature allows you to extract content from your document while maintaining its layout and reading format. Amazon Textract Layout feature is able to detect the following sections:

Title – The main title of the document.
Header – Text located in the top margin of the document.
Footer – Text located in the bottom margin of the document.
Section Title – The titles below the main title that represent sections in the document.
Page Number – The page number of the documents.
List – Any information grouped together in list form.
Figure – Indicates the location of an image in a document.
Table – Indicates the location of a table in the document.
Key Value – Indicates the location of form key-value pairs in a document.
Text – Text that is present typically as a part of paragraphs in documents. It is a catch all for text that is not present in other elements.

What is Retrieval Augmented Generation RAG?

Retrieval-Augmented Generation (RAG) is the process of optimizing the output of a large language model, so it references an authoritative knowledge base outside of its training data sources before generating a response. RAG extends the already powerful capabilities of LLMs to specific domains or an organization's internal knowledge base, all without the need to retrain the model. It is a cost-effective approach to improving LLM output so it remains relevant, accurate, and useful in various contexts.

The following diagram gives a detailled overview how RAG (Retrieval Agumented Generation) works. Image is based on 1.

The semantic and hierarchical structure of the documents needs to be additionally considered during the chunking step to improve the information retrieval results.

What is Layout-ware preprocessing for RAG?

Documents often contain various elements like headings, paragraphs, tables, and lists that convey semantic meaning. Traditional chunking methods, which typically break text into fixed-size segments, can lead to a loss of context and meaning. Layout-aware preprocessing seeks to preserve the relationships between these elements by chunking them based on their logical structure rather than arbitrary token counts.

How to use layout-aware document preprocessing ?

In this repo you will be able to explore the following approaches:

Try the interactive Textract Demo with layout visualization in AWS Console
Utilize Langchain AmazonPDFLoader (sample_notebook)
Utilize Amazon Textract Textractor Library (sample notebook)
Textract API and AWS SDK

Additional Demos and Examples:

Interactive Textract Demo with Layout Visualization in AWS console

Try the interactive Textract Demo for layout analysis in the AWS Console.

Additional Resources

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
data		data
images		images
.gitignore		.gitignore
01-langchain-textract.ipynb		01-langchain-textract.ipynb
02-textractor.ipynb		02-textractor.ipynb
03-tabular-data-qna.ipynb		03-tabular-data-qna.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Layout-aware document preprocessing for RAG

What is Retrieval Augmented Generation RAG?

What is Layout-ware preprocessing for RAG?

How to use layout-aware document preprocessing ?

Interactive Textract Demo with Layout Visualization in AWS console

About

Releases

Packages

Contributors 3

Languages

ArlindNocaj/layout-aware-preprocessing-rag

Folders and files

Latest commit

History

Repository files navigation

Layout-aware document preprocessing for RAG

What is Retrieval Augmented Generation RAG?

What is Layout-ware preprocessing for RAG?

How to use layout-aware document preprocessing ?

Interactive Textract Demo with Layout Visualization in AWS console

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages