demos/rag: Introduce GPU-based Question-Answering

Introduce a new GPU variant of the Question-Answering demo, mirroring its CPU counterpart but leveraging GPUs for both the Embeddings model and the LLM model Inference Services (ISVCs). Using KServe with the Triton Inference Server backend, significantly boosts performance. Triton supports different backends itself, from simple Python scripts to TensorRT engines. The LLM ISVC runs the Llama 2 7B variant on the TensorRT-LLM backend. The Embeddings model ISVC runs BGE-M3, fine-tuned with data from the EzUA, EzDF, MLDE, and MLDM docs, ensuring optimized response accuracy and speed. Signed-off-by: Dimitris Poulopoulos <[email protected]>
HPEEzmeral · Feb 29, 2024 · 6a77763 · 6a77763
1 parent b7b9e64
commit 6a77763
Show file tree

Hide file tree

Showing 55 changed files with 77,368 additions and 0 deletions.
diff --git a/demos/rag-demos/question-answering-gpu/01.create-vectorstore.ipynb b/demos/rag-demos/question-answering-gpu/01.create-vectorstore.ipynb
diff --git a/demos/rag-demos/question-answering-gpu/02.serve-vectorstore.ipynb b/demos/rag-demos/question-answering-gpu/02.serve-vectorstore.ipynb
diff --git a/demos/rag-demos/question-answering-gpu/03.document-prediction.ipynb b/demos/rag-demos/question-answering-gpu/03.document-prediction.ipynb
@@ -0,0 +1,225 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "972db2df-307f-4492-80c6-e84082d778f2",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "# Invoking and Testing the Vector Store Inference Service (Optional)\n",
+    "\n",
+    "Welcome to the third part of the tutorial series on building a question-answering application over a corpus of private\n",
+    "documents using Large Language Models (LLMs). In the previous Notebooks, you've transformed unstructured text data into\n",
+    "structured vector embeddings, stored them in a Vector Store, deployed an Inference Service (ISVC) to serve the Vector Store,\n",
+    "and deploy the fine-tuned embeddings model using KServe and Triton.\n",
+    "\n",
+    "In this Notebook, you focus on invoking the Vector Store ISVC you've created and testing its performance. This\n",
+    "is an essential step, as it allows you to verify the functionality of your service and observe how it performs in\n",
+    "practice. Throughout this Notebook, you construct suitable requests, communicate with the service, and interpret the\n",
+    "responses.\n",
+    "\n",
+    "By the end of this Notebook, you will gain practical insights into the workings of the Vector Store ISVC and will be\n",
+    "well-prepared to integrate it into a larger system, alongside the LLM ISVC that you create in the subsequent Notebook.\n",
+    "\n",
+    "## Table of Contents\n",
+    "\n",
+    "1. [Invoke the Inference Service](#invoke-the-inference-service)\n",
+    "1. [Conclusion and Next Steps](#conclusion-and-next-steps)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "428fd850-d35a-476f-ba05-b11763ddec68",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import json\n",
+    "import getpass\n",
+    "import requests\n",
+    "import ipywidgets as widgets\n",
+    "\n",
+    "from IPython.display import display"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f1f8bb43-af00-4dae-bf22-dec236bcafe7",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "# Invoke the Inference Service\n",
+    "\n",
+    "First, you need to construct the URL you use in POST request. For this example, you use the V1 inference protocol,\n",
+    "described below:\n",
+    "\n",
+    "| API          | Verb | Path                          | Request Payload   | Response Payload                  |\n",
+    "|--------------|------|-------------------------------|-------------------|-----------------------------------|\n",
+    "| List Models  | GET  | /v1/models                    |                   | {\"models\": [<model_name>]}        |\n",
+    "| Model Ready  | GET  | /v1/models/<model_name>       |                   | {\"name\": <model_name>,\"ready\": $bool} |\n",
+    "| Predict      | POST | /v1/models/<model_name>:predict | {\"instances\": []}* | {\"predictions\": []}              |\n",
+    "| Explain      | POST | /v1/models/<model_name>:explain | {\"instances\": []}* | {\"predictions\": [], \"explanations\": []} |\n",
+    "\n",
+    "\\* Payload is optional\n",
+    "\n",
+    "You want to invoke the `predict` API. So let's use a simple query to test the service:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e060904d-a7b8-4835-99d6-f9890b6afbf9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Add heading\n",
+    "heading = widgets.HTML(\"<h2>Credentials</h2>\")\n",
+    "display(heading)\n",
+    "\n",
+    "domain_input = widgets.Text(description='Username:', placeholder=\"i001ua.tryezmeral.com\")\n",
+    "username_input = widgets.Text(description='Username:')\n",
+    "password_input = widgets.Password(description='Password:')\n",
+    "submit_button = widgets.Button(description='Submit')\n",
+    "success_message = widgets.Output()\n",
+    "\n",
+    "domain = None\n",
+    "username = None\n",
+    "password = None\n",
+    "\n",
+    "def submit_button_clicked(b):\n",
+    "    global domain, username, password\n",
+    "    domain = domain_input.value\n",
+    "    username = username_input.value\n",
+    "    password = password_input.value\n",
+    "    with success_message:\n",
+    "        success_message.clear_output()\n",
+    "        print(\"Credentials submitted successfully!\")\n",
+    "    submit_button.disabled = True\n",
+    "\n",
+    "submit_button.on_click(submit_button_clicked)\n",
+    "\n",
+    "# Set margin on the submit button\n",
+    "submit_button.layout.margin = '20px 0 20px 0'\n",
+    "\n",
+    "# Display inputs and button\n",
+    "display(domain_input, username_input, password_input, submit_button, success_message)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4c546186-ac65-457b-b4fd-bb86767cc3d4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "token_url = f\"https://keycloak.{domain}/realms/UA/protocol/openid-connect/token\"\n",
+    "\n",
+    "data = {\n",
+    "    \"username\" : username,\n",
+    "    \"password\" : password,\n",
+    "    \"grant_type\" : \"password\",\n",
+    "    \"client_id\" : \"ua-grant\",\n",
+    "}\n",
+    "\n",
+    "token_responce = requests.post(token_url, data=data, allow_redirects=True, verify=False)\n",
+    "\n",
+    "token = token_responce.json()[\"access_token\"]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "173e2ebd-5e3b-4289-8358-9406ba816921",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "DOMAIN_NAME = \"svc.cluster.local\"\n",
+    "NAMESPACE = \"bob\"\n",
+    "DEPLOYMENT_NAME = \"vectorstore-predictor\"\n",
+    "MODEL_NAME = \"vectorstore\"\n",
+    "SVC = f'{DEPLOYMENT_NAME}.{NAMESPACE}.{DOMAIN_NAME}'\n",
+    "URL = f\"https://{SVC}/v1/models/{MODEL_NAME}:predict\"\n",
+    "\n",
+    "print(URL)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "78da091c-9fce-4f91-8382-e5c785bdf24f",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "data = {\n",
+    "  \"instances\": [{\n",
+    "      \"input\": \"How can I get started with HPE Ezmeral Unified Anaytics?\",\n",
+    "      \"num_docs\": 4  # number of documents to retrieve\n",
+    "  }]\n",
+    "}\n",
+    "\n",
+    "headers = {\"Authorization\": f\"Bearer {token}\"}\n",
+    "\n",
+    "response = requests.post(URL, json=data, headers=headers, verify=False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "461fdac2-cacb-40cc-bf2d-d1548072bb90",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "result = json.loads(response.text)[\"predictions\"]; result"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0e6c9e0e-6d17-4d15-ba4e-e353cc1cd3c2",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "# Conclusion and Next Steps\n",
+    "\n",
+    "Well done! Through this Notebook, you've successfully interacted with and tested the Vector Store ISVC. You've learned\n",
+    "how to construct and send requests to the service and how to interpret the responses. This hands-on experience is\n",
+    "crucial as it provides a practical understanding of the service's operation, preparing you for real-world applications.\n",
+    "\n",
+    "In the next Notebook, you extend your question-answering system by creating an ISVC for the LLM. The LLM ISVC works in\n",
+    "conjunction with the Vector Store ISVC to provide comprehensive and accurate answers to user queries."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}