RAG based systems are a powerful approach to extend the context of LLM requests with additional relevant information. Just chunking the text of documents into sentences or paragraphs is not enough to maintain the full context of the information.
Looking at a research paper for example, it can make a huge difference if a statement like ""The neural network achieved an accuracy of 95% on the test" is under the "Related work" or "Results" section. In the Related Work section it means that this level of accuracy has already been achieved in earlier studies. When the same statement appears in the "Results" section, it carries a different significance. It represents a finding from a current study and suggests that the researchers have developed a new or improved method.
In this repo we show how you can extend RAG systems by considering additional semantics using Layout-aware preprocessing. We utilize Amazon Textract's layout feature. This feature allows you to extract content from your document while maintaining its layout and reading format. Amazon Textract Layout feature is able to detect the following sections:
- Title – The main title of the document.
- Header – Text located in the top margin of the document.
- Footer – Text located in the bottom margin of the document.
- Section Title – The titles below the main title that represent sections in the document.
- Page Number – The page number of the documents.
- List – Any information grouped together in list form.
- Figure – Indicates the location of an image in a document.
- Table – Indicates the location of a table in the document.
- Key Value – Indicates the location of form key-value pairs in a document.
- Text – Text that is present typically as a part of paragraphs in documents. It is a catch all for text that is not present in other elements.
Retrieval-Augmented Generation (RAG) is the process of optimizing the output of a large language model, so it references an authoritative knowledge base outside of its training data sources before generating a response. RAG extends the already powerful capabilities of LLMs to specific domains or an organization's internal knowledge base, all without the need to retrain the model. It is a cost-effective approach to improving LLM output so it remains relevant, accurate, and useful in various contexts.
The following diagram gives a detailled overview how RAG (Retrieval Agumented Generation) works. Image is based on 1.
The semantic and hierarchical structure of the documents needs to be additionally considered during the chunking step to improve the information retrieval results.
Documents often contain various elements like headings, paragraphs, tables, and lists that convey semantic meaning. Traditional chunking methods, which typically break text into fixed-size segments, can lead to a loss of context and meaning. Layout-aware preprocessing seeks to preserve the relationships between these elements by chunking them based on their logical structure rather than arbitrary token counts.
In this repo you will be able to explore the following approaches:
- Try the interactive Textract Demo with layout visualization in AWS Console
- Utilize Langchain AmazonPDFLoader (sample_notebook)
- Utilize Amazon Textract Textractor Library (sample notebook)
- Textract API and AWS SDK
Additional Demos and Examples:
- Converting PDF to HTML, 02-textractor.ipynb
- Converting PDF to Markdown, 02-textractor.ipynb
- Layout-aware Chunking, incl. Figures, 02-textractor.ipynb
- Q&A on Tabular Data, 03-tabular-data-qna.ipynb, documentation
Try the interactive Textract Demo for layout analysis in the AWS Console.
Additional Resources