Add new RAG pipeline project (#97)

* first commit, new llm project * add indexing functionality * finish basic pipeline functionality * update llm_utils and format * refactored the url scraper + utils * refactoring part 2 * fix DB update functionality * add option to switch out the llm within the CLI * use litellm and drop garbage logs * formatting * remove unused title + url * rip out langchain completely * error handling and debug statements * add code inspo acknowledgements * add and update docstrings * remove unused code and use zenml urls * use smaller embedding model * update the dimensionality to match the new embedding model * no cache for embeddings generation * fix constant * visualise embeddings * tiny tweaks to params * add images * update pipeline code to abstract out DB creds * add images * final README updates * add RAG pipeline image * formatting * add super simple RAG pipeline * even more basic RAG * add a third irrelevant question * Refactor preprocess_text and answer_question functions
zenml-io · Apr 9, 2024 · 14352da · 14352da
1 parent a7663b2
commit 14352da
Show file tree

Hide file tree

Showing 22 changed files with 1,186 additions and 0 deletions.
diff --git a/llm-complete-guide/.assets/rag-pipeline-zenml-cloud.png b/llm-complete-guide/.assets/rag-pipeline-zenml-cloud.png
diff --git a/llm-complete-guide/.assets/supabase-connection-string.png b/llm-complete-guide/.assets/supabase-connection-string.png
diff --git a/llm-complete-guide/.assets/supabase-create-project.png b/llm-complete-guide/.assets/supabase-create-project.png
diff --git a/llm-complete-guide/.assets/tsne.png b/llm-complete-guide/.assets/tsne.png
diff --git a/llm-complete-guide/.assets/umap.png b/llm-complete-guide/.assets/umap.png
diff --git a/llm-complete-guide/.dockerignore b/llm-complete-guide/.dockerignore
@@ -0,0 +1,9 @@
+*
+!/pipelines/**
+!/steps/**
+!/materializers/**
+!/evaluate/**
+!/finetune/**
+!/generate/**
+!/lit_gpt/**
+!/scripts/**
diff --git a/llm-complete-guide/LICENSE b/llm-complete-guide/LICENSE
@@ -0,0 +1,15 @@
+Apache Software License 2.0
+
+Copyright (c) ZenML GmbH 2024. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
diff --git a/llm-complete-guide/README.md b/llm-complete-guide/README.md
@@ -0,0 +1,144 @@
+# 🦜 Production-ready RAG pipelines for chat applications
+
+This project showcases how you can work up from a simple RAG pipeline to a more complex setup that
+involves finetuning embeddings, reranking retrieved documents, and even finetuning the
+LLM itself. We'll do this all for a use case relevant to ZenML: a question
+answering system that can provide answers to common questions about ZenML. This
+will help you understand how to apply the concepts covered in this guide to your
+own projects.
+
+![](.assets/rag-pipeline-zenml-cloud.png)
+
+Contained within this project is all the code needed to run the full pipelines.
+You can follow along [in our guide](https://docs.zenml.io/user-guide/llmops-guide/) to understand the decisions and tradeoffs
+behind the pipeline and step code contained here. You'll build a solid understanding of how to leverage
+LLMs in your MLOps workflows using ZenML, enabling you to build powerful,
+scalable, and maintainable LLM-powered applications.
+
+This project contains all the pipeline and step code necessary to follow along
+with the guide. You'll need a PostgreSQL database to store the embeddings; full
+instructions are provided below for how to set that up.
+
+## 🙏🏻 Inspiration and Credit
+
+The RAG pipeline relies on code from [this Timescale
+blog](https://www.timescale.com/blog/postgresql-as-a-vector-database-create-store-and-query-openai-embeddings-with-pgvector/)
+that showcased using PostgreSQL as a vector database. We adapted it for our use
+case and adapted it to work with Supabase.
+
+## 🏃 How to run
+
+This project showcases production-ready pipelines so we use some cloud
+infrastructure to manage the assets. You can run the pipelines locally using a
+local PostgreSQL database, but we encourage you to use a cloud database for
+production use cases.
+
+### Connecting to ZenML Cloud
+
+If you run the pipeline using ZenML Cloud you'll have access to the managed
+dashboard which will allow you to get started quickly. We offer a free trial so
+you can try out the platform without any cost. Visit the [ZenML Cloud
+dashboard](https://cloud.zenml.io/) to get started.
+
+### Setting up Supabase
+
+[Supabase](https://supabase.com/) is a cloud provider that provides a PostgreSQL database. It's simple to
+use and has a free tier that should be sufficient for this project. Once you've
+created a Supabase account and organisation, you'll need to create a new
+project.
+
+![](.assets/supabase-create-project.png)
+
+You'll then want to connect to this database instance by getting the connection
+string from the Supabase dashboard.
+
+![](.assets/supabase-connection-string.png)
+
+You'll then use these details to populate some environment variables where the pipeline code expects them:
+
+```shell
+export ZENML_SUPABASE_USER=<your-supabase-user>
+export ZENML_SUPABASE_HOST=<your-supabase-host>
+export ZENML_SUPABASE_PORT=<your-supabase-port>
+```
+
+You'll want to save the Supabase database password as a ZenML secret so that it
+isn't stored in plaintext. You can do this by running the following command:
+
+```shell
+zenml secret create supabase_postgres_db --password="YOUR_PASSWORD"
+```
+
+### Running the RAG pipeline
+
+To run the pipeline, you can use the `run.py` script. This script will allow you
+to run the pipelines in the correct order. You can run the script with the
+following command:
+
+```shell
+python run.py --basic-rag
+```
+
+This will run the basic RAG pipeline, which scrapes the ZenML documentation and stores the embeddings in the Supabase database.
+
+### Querying your RAG pipeline assets
+
+Once the pipeline has run successfully, you can query the assets in the Supabase
+database using the `--rag-query` flag as well as passing in the model you'd like
+to use for the LLM.
+
+In order to use the default LLM for this query, you'll need an account
+and an API key from OpenAI specified as another environment variable:
+
+```shell
+export OPENAI_API_KEY=<your-openai-api-key>
+```
+
+When you're ready to make the query, run the following command:
+
+```shell
+python run.py --rag-query "how do I use a custom materializer inside my own zenml steps? i.e. how do I set it? inside the @step decorator?" --model=gpt4
+```
+
+Alternative options for LLMs to use include:
+
+- `gpt4`
+- `gpt35`
+- `claude3`
+- `claudehaiku`
+
+Note that Claude will require a different API key from Anthropic. See [the
+`litellm` docs](https://docs.litellm.ai/docs/providers/anthropic) on how to set this up.
+
+## ☁️ Running with a remote stack
+
+The basic RAG pipeline will run using a local stack, but if you want to improve
+the speed of the embeddings step you might want to consider using a cloud
+orchestrator. Please follow the instructions in [our basic cloud setup guides](https://docs.zenml.io/user-guide/cloud-guide)
+(currently available for [AWS](https://docs.zenml.io/user-guide/cloud-guide/aws-guide) and [GCP](https://docs.zenml.io/user-guide/cloud-guide/gcp-guide)) to learn how you can run the pipelines on
+a remote stack.
+
+## 📜 Project Structure
+
+The project loosely follows [the recommended ZenML project structure](https://docs.zenml.io/user-guide/starter-guide/follow-best-practices):
+
+```
+.
+├── LICENSE                                             # License file
+├── README.md                                           # This file
+├── constants.py                                        # Constants for the project
+├── pipelines
+│   ├── __init__.py                                    
+│   └── llm_basic_rag.py                                # Basic RAG pipeline
+├── requirements.txt                                    # Requirements file
+├── run.py                                              # Script to run the pipelines
+├── steps
+│   ├── __init__.py                                     
+│   ├── populate_index.py                               # Step to populate the index
+│   ├── url_scraper.py                                  # Step to scrape the URLs
+│   ├── url_scraping_utils.py                           # Utilities for the URL scraper
+│   └── web_url_loader.py                               # Step to load the URLs
+└── utils                                              
+    ├── __init__.py
+    └── llm_utils.py                                    # Utilities related to the LLM
+```
diff --git a/llm-complete-guide/constants.py b/llm-complete-guide/constants.py
@@ -0,0 +1,19 @@
+# Vector Store constants
+CHUNK_SIZE = 500
+CHUNK_OVERLAP = 50
+EMBEDDING_DIMENSIONALITY = (
+    384  # Update this to match the dimensionality of the new model
+)
+
+# Scraping constants
+RATE_LIMIT = 5  # Maximum number of requests per second
+
+# LLM Utils constants
+OPENAI_MODEL = "gpt-3.5-turbo"
+EMBEDDINGS_MODEL = "sentence-transformers/all-MiniLM-L12-v2"
+MODEL_NAME_MAP = {
+    "gpt4": "gpt-4-0125-preview",
+    "gpt35": "gpt-3.5-turbo",
+    "claude3": "claude-3-opus-20240229",
+    "claudehaiku": "claude-3-haiku-20240307",
+}
diff --git a/llm-complete-guide/materializers/__init__.py b/llm-complete-guide/materializers/__init__.py
@@ -0,0 +1,16 @@
+# Apache Software License 2.0
+#
+# Copyright (c) ZenML GmbH 2024. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
diff --git a/llm-complete-guide/most_basic_rag_pipeline.py b/llm-complete-guide/most_basic_rag_pipeline.py
@@ -0,0 +1,87 @@
+import os
+import re
+import string
+
+from openai import OpenAI
+
+
+def preprocess_text(text):
+    text = text.lower()
+    text = text.translate(str.maketrans("", "", string.punctuation))
+    text = re.sub(r"\s+", " ", text).strip()
+    return text
+
+
+def tokenize(text):
+    return preprocess_text(text).split()
+
+
+def retrieve_relevant_chunks(query, corpus, top_n=2):
+    query_tokens = set(tokenize(query))
+    similarities = []
+    for chunk in corpus:
+        chunk_tokens = set(tokenize(chunk))
+        similarity = len(query_tokens.intersection(chunk_tokens)) / len(
+            query_tokens.union(chunk_tokens)
+        )
+        similarities.append((chunk, similarity))
+    similarities.sort(key=lambda x: x[1], reverse=True)
+    return [chunk for chunk, _ in similarities[:top_n]]
+
+
+def answer_question(query, corpus, top_n=2):
+    relevant_chunks = retrieve_relevant_chunks(query, corpus, top_n)
+    if not relevant_chunks:
+        return "I don't have enough information to answer the question."
+
+    context = "\n".join(relevant_chunks)
+    client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
+    chat_completion = client.chat.completions.create(
+        messages=[
+            {
+                "role": "system",
+                "content": f"Based on the provided context, answer the following question: {query}\n\nContext:\n{context}",
+            },
+            {
+                "role": "user",
+                "content": query,
+            },
+        ],
+        model="gpt-3.5-turbo",
+    )
+
+    return chat_completion.choices[0].message.content.strip()
+
+
+# Sci-fi themed corpus about "ZenML World"
+corpus = [
+    "The luminescent forests of ZenML World are inhabited by glowing Zenbots that emit a soft, pulsating light as they roam the enchanted landscape.",
+    "In the neon skies of ZenML World, Cosmic Butterflies flutter gracefully, their iridescent wings leaving trails of stardust in their wake.",
+    "Telepathic Treants, ancient sentient trees, communicate through the quantum neural network that spans the entire surface of ZenML World, sharing wisdom and knowledge.",
+    "Deep within the melodic caverns of ZenML World, Fractal Fungi emit pulsating tones that resonate through the crystalline structures, creating a symphony of otherworldly sounds.",
+    "Near the ethereal waterfalls of ZenML World, Holographic Hummingbirds hover effortlessly, their translucent wings refracting the prismatic light into mesmerizing patterns.",
+    "Gravitational Geckos, masters of anti-gravity, traverse the inverted cliffs of ZenML World, defying the laws of physics with their extraordinary abilities.",
+    "Plasma Phoenixes, majestic creatures of pure energy, soar above the chromatic canyons of ZenML World, their fiery trails painting the sky in a dazzling display of colors.",
+    "Along the prismatic shores of ZenML World, Crystalline Crabs scuttle and burrow, their transparent exoskeletons refracting the light into a kaleidoscope of hues.",
+]
+
+# Preprocess the corpus
+corpus = [preprocess_text(sentence) for sentence in corpus]
+
+# Ask questions
+question1 = "What are Plasma Phoenixes?"
+answer1 = answer_question(question1, corpus)
+print(f"Question: {question1}")
+print(f"Answer: {answer1}")
+
+question2 = (
+    "What kinds of creatures live on the prismatic shores of ZenML World?"
+)
+answer2 = answer_question(question2, corpus)
+print(f"Question: {question2}")
+print(f"Answer: {answer2}")
+
+irrelevant_question_3 = "What is the capital of Panglossia?"
+answer3 = answer_question(irrelevant_question_3, corpus)
+print(f"Question: {irrelevant_question_3}")
+print(f"Answer: {answer3}")
diff --git a/llm-complete-guide/pipelines/__init__.py b/llm-complete-guide/pipelines/__init__.py
@@ -0,0 +1,17 @@
+# Apache Software License 2.0
+#
+# Copyright (c) ZenML GmbH 2024. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+from pipelines.llm_basic_rag import llm_basic_rag
diff --git a/llm-complete-guide/pipelines/llm_basic_rag.py b/llm-complete-guide/pipelines/llm_basic_rag.py
@@ -0,0 +1,26 @@
+from steps.populate_index import (
+    generate_embeddings,
+    index_generator,
+    preprocess_documents,
+)
+from steps.url_scraper import url_scraper
+from steps.web_url_loader import web_url_loader
+from zenml import pipeline
+
+
+@pipeline
+def llm_basic_rag() -> None:
+    """Executes the pipeline to train a basic RAG model.
+
+    This function performs the following steps:
+    1. Scrapes URLs using the url_scraper function.
+    2. Loads documents from the scraped URLs using the web_url_loader function.
+    3. Preprocesses the loaded documents using the preprocess_documents function.
+    4. Generates embeddings for the preprocessed documents using the generate_embeddings function.
+    5. Generates an index for the embeddings and documents using the index_generator function.
+    """
+    urls = url_scraper()
+    docs = web_url_loader(urls=urls)
+    processed_docs = preprocess_documents(documents=docs)
+    embeddings = generate_embeddings(split_documents=processed_docs)
+    index_generator(embeddings=embeddings, documents=docs)
diff --git a/llm-complete-guide/requirements.txt b/llm-complete-guide/requirements.txt
@@ -0,0 +1,13 @@
+zenml
+langchain-community
+ratelimit
+langchain>=0.0.325
+langchain-openai
+pgvector
+psycopg2-binary
+beautifulsoup4
+unstructured
+pandas
+numpy
+sentence-transformers
+litellm