RAG inserts relevant information into the prompt by loading it from a vector database.
Run a RAG spot check:
./scripts/rag_tuning.sh
The RAG index will be stored in 04_rag_tuning/rag_model
.
View the RAG results at data/results/rag_spot_check_results.jsonl.
Similar to prompt tuning, you can tune the RAG parameters and the surrounding prompt:
k
: number of nearest neighborsn
: number of most diverse results- Set to be the same as
k
to return allk
nearest neighbors
- Set to be the same as
batch_size
(needs to be changed before generating the index): length of each chunk returned- Smaller chunks tend to provide more accurate results but can increase computational overhead, larger chunks may improve efficiency but reduce accuracy.
lamini-examples/04_rag_tuning/lamini_rag/lamini_rag_model_stage.py
Lines 38 to 51 in d01af0b
lamini-examples/04_rag_tuning/lamini_rag/earnings_call_loader.py
Lines 5 to 11 in 6dc1564
An embedding model converts text into a vector embedding (a list of floating point numbers). The floating point numbers are coordinates in a vector space. In a good vector space, similar concepts will be nearby. E.g. "King" will be close to "Queen" in the space.
Every LLM is an embedding model! Here is a list of common embedding models. https://huggingface.co/spaces/mteb/leaderboard
- Convert your data into chunks.
- Then run it through an embedding model. (Note that this is expensive because it calls an LLM)
- Store the embedding vectors in an index (e.g. a list)
- Compute the embedding of your query.
- Look up most relevant matches in the index.
- Insert them into the prompt.
Consider more advanced optimizations described here: https://docs.google.com/presentation/d/118e4WWR4eWViJ_dTzQ5V3wwa_Eh95e5TQVklQz8hR1A/edit?usp=sharing