Skip to content

Commit

Permalink
Add new RAG pipeline project (#97)
Browse files Browse the repository at this point in the history
* first commit, new llm project

* add indexing functionality

* finish basic pipeline functionality

* update llm_utils and format

* refactored the url scraper + utils

* refactoring part 2

* fix DB update functionality

* add option to switch out the llm within the CLI

* use litellm and drop garbage logs

* formatting

* remove unused title + url

* rip out langchain completely

* error handling and debug statements

* add code inspo acknowledgements

* add and update docstrings

* remove unused code and use zenml urls

* use smaller embedding model

* update the dimensionality to match the new embedding model

* no cache for embeddings generation

* fix constant

* visualise embeddings

* tiny tweaks to params

* add images

* update pipeline code to abstract out DB creds

* add images

* final README updates

* add RAG pipeline image

* formatting

* add super simple RAG pipeline

* even more basic RAG

* add a third irrelevant question

* Refactor preprocess_text and answer_question functions
  • Loading branch information
strickvl authored Apr 9, 2024
1 parent a7663b2 commit 14352da
Show file tree
Hide file tree
Showing 22 changed files with 1,186 additions and 0 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added llm-complete-guide/.assets/tsne.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added llm-complete-guide/.assets/umap.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 9 additions & 0 deletions llm-complete-guide/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
*
!/pipelines/**
!/steps/**
!/materializers/**
!/evaluate/**
!/finetune/**
!/generate/**
!/lit_gpt/**
!/scripts/**
15 changes: 15 additions & 0 deletions llm-complete-guide/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
Apache Software License 2.0

Copyright (c) ZenML GmbH 2024. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
144 changes: 144 additions & 0 deletions llm-complete-guide/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# 🦜 Production-ready RAG pipelines for chat applications

This project showcases how you can work up from a simple RAG pipeline to a more complex setup that
involves finetuning embeddings, reranking retrieved documents, and even finetuning the
LLM itself. We'll do this all for a use case relevant to ZenML: a question
answering system that can provide answers to common questions about ZenML. This
will help you understand how to apply the concepts covered in this guide to your
own projects.

![](.assets/rag-pipeline-zenml-cloud.png)

Contained within this project is all the code needed to run the full pipelines.
You can follow along [in our guide](https://docs.zenml.io/user-guide/llmops-guide/) to understand the decisions and tradeoffs
behind the pipeline and step code contained here. You'll build a solid understanding of how to leverage
LLMs in your MLOps workflows using ZenML, enabling you to build powerful,
scalable, and maintainable LLM-powered applications.

This project contains all the pipeline and step code necessary to follow along
with the guide. You'll need a PostgreSQL database to store the embeddings; full
instructions are provided below for how to set that up.

## 🙏🏻 Inspiration and Credit

The RAG pipeline relies on code from [this Timescale
blog](https://www.timescale.com/blog/postgresql-as-a-vector-database-create-store-and-query-openai-embeddings-with-pgvector/)
that showcased using PostgreSQL as a vector database. We adapted it for our use
case and adapted it to work with Supabase.

## 🏃 How to run

This project showcases production-ready pipelines so we use some cloud
infrastructure to manage the assets. You can run the pipelines locally using a
local PostgreSQL database, but we encourage you to use a cloud database for
production use cases.

### Connecting to ZenML Cloud

If you run the pipeline using ZenML Cloud you'll have access to the managed
dashboard which will allow you to get started quickly. We offer a free trial so
you can try out the platform without any cost. Visit the [ZenML Cloud
dashboard](https://cloud.zenml.io/) to get started.

### Setting up Supabase

[Supabase](https://supabase.com/) is a cloud provider that provides a PostgreSQL database. It's simple to
use and has a free tier that should be sufficient for this project. Once you've
created a Supabase account and organisation, you'll need to create a new
project.

![](.assets/supabase-create-project.png)

You'll then want to connect to this database instance by getting the connection
string from the Supabase dashboard.

![](.assets/supabase-connection-string.png)

You'll then use these details to populate some environment variables where the pipeline code expects them:

```shell
export ZENML_SUPABASE_USER=<your-supabase-user>
export ZENML_SUPABASE_HOST=<your-supabase-host>
export ZENML_SUPABASE_PORT=<your-supabase-port>
```

You'll want to save the Supabase database password as a ZenML secret so that it
isn't stored in plaintext. You can do this by running the following command:

```shell
zenml secret create supabase_postgres_db --password="YOUR_PASSWORD"
```

### Running the RAG pipeline

To run the pipeline, you can use the `run.py` script. This script will allow you
to run the pipelines in the correct order. You can run the script with the
following command:

```shell
python run.py --basic-rag
```

This will run the basic RAG pipeline, which scrapes the ZenML documentation and stores the embeddings in the Supabase database.

### Querying your RAG pipeline assets

Once the pipeline has run successfully, you can query the assets in the Supabase
database using the `--rag-query` flag as well as passing in the model you'd like
to use for the LLM.

In order to use the default LLM for this query, you'll need an account
and an API key from OpenAI specified as another environment variable:

```shell
export OPENAI_API_KEY=<your-openai-api-key>
```

When you're ready to make the query, run the following command:

```shell
python run.py --rag-query "how do I use a custom materializer inside my own zenml steps? i.e. how do I set it? inside the @step decorator?" --model=gpt4
```

Alternative options for LLMs to use include:

- `gpt4`
- `gpt35`
- `claude3`
- `claudehaiku`

Note that Claude will require a different API key from Anthropic. See [the
`litellm` docs](https://docs.litellm.ai/docs/providers/anthropic) on how to set this up.

## ☁️ Running with a remote stack

The basic RAG pipeline will run using a local stack, but if you want to improve
the speed of the embeddings step you might want to consider using a cloud
orchestrator. Please follow the instructions in [our basic cloud setup guides](https://docs.zenml.io/user-guide/cloud-guide)
(currently available for [AWS](https://docs.zenml.io/user-guide/cloud-guide/aws-guide) and [GCP](https://docs.zenml.io/user-guide/cloud-guide/gcp-guide)) to learn how you can run the pipelines on
a remote stack.

## 📜 Project Structure

The project loosely follows [the recommended ZenML project structure](https://docs.zenml.io/user-guide/starter-guide/follow-best-practices):

```
.
├── LICENSE # License file
├── README.md # This file
├── constants.py # Constants for the project
├── pipelines
│   ├── __init__.py
│   └── llm_basic_rag.py # Basic RAG pipeline
├── requirements.txt # Requirements file
├── run.py # Script to run the pipelines
├── steps
│   ├── __init__.py
│   ├── populate_index.py # Step to populate the index
│   ├── url_scraper.py # Step to scrape the URLs
│   ├── url_scraping_utils.py # Utilities for the URL scraper
│   └── web_url_loader.py # Step to load the URLs
└── utils
├── __init__.py
└── llm_utils.py # Utilities related to the LLM
```
19 changes: 19 additions & 0 deletions llm-complete-guide/constants.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Vector Store constants
CHUNK_SIZE = 500
CHUNK_OVERLAP = 50
EMBEDDING_DIMENSIONALITY = (
384 # Update this to match the dimensionality of the new model
)

# Scraping constants
RATE_LIMIT = 5 # Maximum number of requests per second

# LLM Utils constants
OPENAI_MODEL = "gpt-3.5-turbo"
EMBEDDINGS_MODEL = "sentence-transformers/all-MiniLM-L12-v2"
MODEL_NAME_MAP = {
"gpt4": "gpt-4-0125-preview",
"gpt35": "gpt-3.5-turbo",
"claude3": "claude-3-opus-20240229",
"claudehaiku": "claude-3-haiku-20240307",
}
16 changes: 16 additions & 0 deletions llm-complete-guide/materializers/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Apache Software License 2.0
#
# Copyright (c) ZenML GmbH 2024. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
87 changes: 87 additions & 0 deletions llm-complete-guide/most_basic_rag_pipeline.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
import os
import re
import string

from openai import OpenAI


def preprocess_text(text):
text = text.lower()
text = text.translate(str.maketrans("", "", string.punctuation))
text = re.sub(r"\s+", " ", text).strip()
return text


def tokenize(text):
return preprocess_text(text).split()


def retrieve_relevant_chunks(query, corpus, top_n=2):
query_tokens = set(tokenize(query))
similarities = []
for chunk in corpus:
chunk_tokens = set(tokenize(chunk))
similarity = len(query_tokens.intersection(chunk_tokens)) / len(
query_tokens.union(chunk_tokens)
)
similarities.append((chunk, similarity))
similarities.sort(key=lambda x: x[1], reverse=True)
return [chunk for chunk, _ in similarities[:top_n]]


def answer_question(query, corpus, top_n=2):
relevant_chunks = retrieve_relevant_chunks(query, corpus, top_n)
if not relevant_chunks:
return "I don't have enough information to answer the question."

context = "\n".join(relevant_chunks)
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
chat_completion = client.chat.completions.create(
messages=[
{
"role": "system",
"content": f"Based on the provided context, answer the following question: {query}\n\nContext:\n{context}",
},
{
"role": "user",
"content": query,
},
],
model="gpt-3.5-turbo",
)

return chat_completion.choices[0].message.content.strip()


# Sci-fi themed corpus about "ZenML World"
corpus = [
"The luminescent forests of ZenML World are inhabited by glowing Zenbots that emit a soft, pulsating light as they roam the enchanted landscape.",
"In the neon skies of ZenML World, Cosmic Butterflies flutter gracefully, their iridescent wings leaving trails of stardust in their wake.",
"Telepathic Treants, ancient sentient trees, communicate through the quantum neural network that spans the entire surface of ZenML World, sharing wisdom and knowledge.",
"Deep within the melodic caverns of ZenML World, Fractal Fungi emit pulsating tones that resonate through the crystalline structures, creating a symphony of otherworldly sounds.",
"Near the ethereal waterfalls of ZenML World, Holographic Hummingbirds hover effortlessly, their translucent wings refracting the prismatic light into mesmerizing patterns.",
"Gravitational Geckos, masters of anti-gravity, traverse the inverted cliffs of ZenML World, defying the laws of physics with their extraordinary abilities.",
"Plasma Phoenixes, majestic creatures of pure energy, soar above the chromatic canyons of ZenML World, their fiery trails painting the sky in a dazzling display of colors.",
"Along the prismatic shores of ZenML World, Crystalline Crabs scuttle and burrow, their transparent exoskeletons refracting the light into a kaleidoscope of hues.",
]

# Preprocess the corpus
corpus = [preprocess_text(sentence) for sentence in corpus]

# Ask questions
question1 = "What are Plasma Phoenixes?"
answer1 = answer_question(question1, corpus)
print(f"Question: {question1}")
print(f"Answer: {answer1}")

question2 = (
"What kinds of creatures live on the prismatic shores of ZenML World?"
)
answer2 = answer_question(question2, corpus)
print(f"Question: {question2}")
print(f"Answer: {answer2}")

irrelevant_question_3 = "What is the capital of Panglossia?"
answer3 = answer_question(irrelevant_question_3, corpus)
print(f"Question: {irrelevant_question_3}")
print(f"Answer: {answer3}")
17 changes: 17 additions & 0 deletions llm-complete-guide/pipelines/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Apache Software License 2.0
#
# Copyright (c) ZenML GmbH 2024. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
from pipelines.llm_basic_rag import llm_basic_rag
26 changes: 26 additions & 0 deletions llm-complete-guide/pipelines/llm_basic_rag.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
from steps.populate_index import (
generate_embeddings,
index_generator,
preprocess_documents,
)
from steps.url_scraper import url_scraper
from steps.web_url_loader import web_url_loader
from zenml import pipeline


@pipeline
def llm_basic_rag() -> None:
"""Executes the pipeline to train a basic RAG model.
This function performs the following steps:
1. Scrapes URLs using the url_scraper function.
2. Loads documents from the scraped URLs using the web_url_loader function.
3. Preprocesses the loaded documents using the preprocess_documents function.
4. Generates embeddings for the preprocessed documents using the generate_embeddings function.
5. Generates an index for the embeddings and documents using the index_generator function.
"""
urls = url_scraper()
docs = web_url_loader(urls=urls)
processed_docs = preprocess_documents(documents=docs)
embeddings = generate_embeddings(split_documents=processed_docs)
index_generator(embeddings=embeddings, documents=docs)
13 changes: 13 additions & 0 deletions llm-complete-guide/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
zenml
langchain-community
ratelimit
langchain>=0.0.325
langchain-openai
pgvector
psycopg2-binary
beautifulsoup4
unstructured
pandas
numpy
sentence-transformers
litellm
Loading

0 comments on commit 14352da

Please sign in to comment.