Running and deploying this application
This repo includes a multi-collection RAG pattern implementation using C# minimal API and Semantic Kernel (SK). Through interfaces and configuration, SK supports different database connectors like Azure Search, PostgreSQL, Duck DB, Redis, volatile memory (a RAM DB), and others. As this implementation is multi-collection, each collection could represent different entities such as customers, business units, or areas.
There are areas of concern that need to be taken into consideration in RAG patterns such as:
- Ingestion
- Managing the sources (text, PDFs, images, etc.).
- Extracting the text from the sources.
- Maybe keeping track of the source locations (to quote references).
- Text Chunking or smart chunking
- Chunking large text sources into smaller pieces.
- Embedding and vector DB storage
- Embedding the text chunks (basically, convert the text to a numerical vector representation)
- Saving the chunks in a vector DB. In SK, the text and the text embedding are called a memory.
- Working with Token Limits
- Token limitations in the LLM and embedding models.
- In OpenAI GPT models, the token limits apply to both a prompt and completion. In other words, if the token limit is 2048, and the prompt is 1800 tokens, the maximum response can be 248 tokens.
- Processing Prompt and completions
- Turning the query into an embedding
- Comparing the query embedding against the vector DB embeddings returning the relevance scores and requested limits.
- Using the text in the top relevant results to augment the prompt.
- Sending the prompt for completion with the original query and the augmented context.
In SK, a memory is an object that includes among other things an ID (could be a URL to the original document), text (generally a text chunk), and a text embedding. This API allows you to ingest, recall, query, and delete SK memories.
graph LR;
Frontend<--Query<br/>Result-->Backend
Backend<--Query<br/>Limits<br/>Relevance<br/>Results-->VectorDB(Multi-Collection<br/>Vector DB)
Backend<--Augmented Prompt<br/>Embedding<br/>Completion-->OpenAI
Backend<--file name<br/>and chunk-->Ingest
Ingest<--Text-->Data[Files<br/>in folder]
graph LR;
subgraph Ingestion
A(Text Extraction)-->B(Text Chunking)--Tokenizer-->E(Text Embedding)--Text Chunk<br/>and Vector-->D(Store<br/>in VectorDB)
end;
subgraph Grounding
Query--Embbed Query-->Search
Search(Search<br/>the VectorDB)--Results-->Promt(Augment<br/>Prompt)
end;
Ingestion-->Grounding
subgraph Completion
AP(Augemented<br/>Prompt)--POST-->SCO(GPT<br/>Completion)
end;
Grounding-->Completion
Ingestion is the process of extracting the text from your source files, chunking, and vectorization the chunks, and saving the chunks to a vector database (sometimes this is called memorization).
The project ingestion
is a C# console application that will read the files in the data/
folder, and read the text in all the text files.
Note: I kept this stage simple using only text files, but this process can be more complex by extracting text from different file types such as PDFs and Office files.
Text chunking is a technique that divides a text into smaller units, such as sentences, phrases, or words. The ingestion
console application extracts the text from the files, chunks the text based on the chun_size
constant and using Semantic Kernel's chunker functions, and finally sends the API requests to the backend
to POST the file name, file URL, and Chunk text.
During the final stage, the ingestion
application sends a POST request to the server's /api/gpt/memory
endpoint. This endpoint takes the chunk of text it has received and the file's URL, vectorizes the text using the OpenAI ADA
endpoint, and finally stores the text chunk, file name, file URL, and vector in a vector database.
Important:
It is important to understand that the prompt is augmented with the text chunks and not the vectors. The vectors are used to find the relevant chunks.
This application can work with multiple collections. A Collection can be a logical boundary for RAG resources. In other words, you could have RAG resources across companies like in a SaaS application, across OUs to separate content from different areas of a company, etc. Every API endpoint carries the collection
parameter.
During this stage, the user submits a query using the frontend
. The frontend
, in turn, sends a POST request to the server's /api/gpt/query
endpoint. The API vectorizes the query and compares this vector against each chunk in the vector database.
Those chunks having a high relevance are returned and used to augment the prompt together with the initial query. The POST query payload carries a response limit and minimum relevance helpful to more or less chunks with higher or lower relevance.
What is Grounding? "Grounding is the process of using large language models (LLMs) with information that is use-case specific, relevant, and not available as part of the LLM's trained knowledge."
The augmented prompt is submitted to the OpenAI GPT endpoint for completion, and finally, the results of the completion are rendered to the user in the frontend
.
- Review the extracted text and apply cleanup and reformatting if necessary
- Use the playground to iterate over different prompts (prompt engineering)
- Take advantage of your model's token limits
- Be mindful to include the completion tokens in this analysis
- Test different combinations of chunking logic and chunk sizes
- A good starting point is 1024K tokens
- Involve SMEs in reviewing the quality and context of recalled memories against the queries
- Apply quality and RAI baselines to the expected results