The
RAG
module enables the creation of a modular RAG inference pipeline for your indexed multimodal documents, using two inference modes:
- API: Creates a server hosting the pipeline
- Local: Runs the inference locally (:warning: might be long when running local models :warning:)
You can customize various parts of the pipeline by defining an inference RAG configuration file.
Here is a minimal example to create a RAG pipeline hosted through LangServe servers.
-
Create your RAG Inference config file
# RAG Config rag: # LLM Config llm: llm_name: "gpt-4o-mini" # Anything supported max_new_tokens: 100 temperature: 0.8 # Retriever Config retriever: db: uri: ./proc_demo.db hybrid_search_weight: 0.5 k: 5 # Prompt Args system_prompt: "Answer the question using the context.\n\nContext: {context}" # Mode Config mode: api mode_args: endpoint: '/rag' port: '8000' host: 'localhost'
-
Start your RAG pipeline using the
run_rag.py
script and your config filepython src/mmore/rag/run_rag.py --config_file /path/to/config.yaml
-
Query the server like any other LangServe server
curl --location --request POST http://localhost:8000/rag/invoke \ -H 'Content-Type: application/json' \ -d '{ "input": { "input": "What is Meditron?", "collection_name": "med_docs" } }'
curl --location --request GET http://localhost:8000/rag/input_schema \ -H 'Content-Type: application/json'
See examples/rag
for other use cases.
The RAG decomposes into two main modules:
- The
Retriever
, which retrieves multimodal documents from the database. - The
LLM
, which wraps different types of multimodal-able LLMs.
Here is an example on how to use the retriever module alone. Note that it assumes that you already created a DB using the indexer module.
-
Create a config:
db: uri: ./demo.db name: db_name hybrid_search_weight: 0.5 k: 5
-
Retrieve on the vector store using the
Retriever
class:from src.mmore.rag.retriever import Retriever # Create the Retriever retriever = Retriever.from_config('/path/to/your/retriever_config.yaml') # Retrieves the top 3 documents using an hybrid approach (e.g. dense + sparse embeddings) retriever.retrieve( 'What is Meditron?', k=3, collection_name="my_docs", search_type="hybrid" # Options: "dense", "sparse", "hybrid" )
Here is an example on how to use the LLM
module alone. Note that it assumes that you already created a DB using the indexer module.
-
Create a config file:
llm_name: gpt-4o-mini max_new_tokens: 150 temperature: 0.7
-
Query the LLM:
from src.mmore.rag.llm import LLM # Create the LLM llm = LLM.from_config('/path/to/your/llm_config.yaml') # Create your messages messages = [ ( "system", "You are a helpful assistant that translates English to French. Translate the user sentence.", ), ("human", "I love Meditron."), ] # Retrieves the top 3 documents using an hybrid approach (e.g. dense + sparse embeddings) llm.invoke(messages)
Our RAG pipeline is built to take full advantage of LangChain abstractions, providing compatibility with all components offered.
Our retriever is a LangChain BaseRetriever
. If you want to create a custom retriever (e.g. GraphRetriever,...) you can simply make it inherit from this class and use it as described in our examples.
Our LLMs are LangChain's BaseChatModel
base class. If you want to create a custom retriever you can simply make it inherit from this class and use it as described in our examples.
⚠️ Note that we support HuggingFace Hub models, so a simpler solution is to push a model to the hub and use the class as defined.