diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml index 935556f75..79c6af51c 100644 --- a/docs/source/_toctree.yml +++ b/docs/source/_toctree.yml @@ -26,6 +26,8 @@ title: Sentence Transformers on AWS Inferentia - local: inference_tutorials/stable_diffusion title: Generate images with Stable Diffusion models on AWS Inferentia + - local: inference_tutorials/qwen2-5-7b-chatbot + title: Deploy Qwen 2.5 7B Instruct on AWS EC2 title: Inference Tutorials - sections: - local: guides/setup_aws_instance diff --git a/docs/source/inference_tutorials/qwen2-5-7b-chatbot.mdx b/docs/source/inference_tutorials/qwen2-5-7b-chatbot.mdx new file mode 100644 index 000000000..81b7b7aa0 --- /dev/null +++ b/docs/source/inference_tutorials/qwen2-5-7b-chatbot.mdx @@ -0,0 +1,226 @@ + +# Deploy Qwen 2.5 7B Instruct on AWS Inferentia + +*There is a notebook version of that tutorial [here](https://github.com/huggingface/optimum-neuron/blob/main/notebooks/text-generation/qwen2-5-7b-chatbot.ipynb)*. + +This guide will detail how to export, deploy and run a **Qwen2.5 7B Instruct** model on AWS inferentia. + +You will learn how to: +- set up your AWS instance, +- export the Qwen 2.5 model to the Neuron format, +- push the exported model to the Hugging Face Hub, +- deploy the model and use it in a chat application. + +Note: This tutorial was created on a inf2.48xlarge AWS EC2 Instance. + +## 1. Export the Qwen 2.5 model to Neuron + +As explained in the [optimum-neuron documentation](https://huggingface.co/docs/optimum-neuron/guides/export_model#why-compile-to-neuron-model) +, models need to be compiled and exported to a serialized format before running them on Neuron devices. + +Fortunately, 🤗 **optimum-neuron** offers an [API](https://huggingface.co/docs/optimum-neuron/guides/models#configuring-the-export-of-a-generative-model) +to export standard 🤗 [transformers models](https://huggingface.co/docs/transformers/index) to the Neuron format. + +When exporting the model, we will specify two sets of parameters: + +- using *compiler_args*, we specify on how many cores we want the model to be deployed (each neuron device has two cores), and with which precision (here *bfloat16*), +- using *input_shapes*, we set the static input and output dimensions of the model. All model compilers require static shapes, and neuron makes no exception. Note that the +*sequence_length* not only constrains the length of the input context, but also the length of the Key/Value cache, and thus, the output length. + +Depending on your choice of parameters and inferentia host, this may take from a few minutes to more than an hour. + +For your convenience, we host a pre-compiled version of that model on the Hugging Face hub, so you can skip the export and start using the model immediately in section 2. + +```python +from optimum.neuron import NeuronModelForCausalLM + + +compiler_args = {"num_cores": 24, "auto_cast_type": 'bf16'} +input_shapes = {"batch_size": 32, "sequence_length": 4096} +model = NeuronModelForCausalLM.from_pretrained( + "Qwen/Qwen2.5-7B-Instruct", + export=True, + **compiler_args, + **input_shapes) +``` + +This will probably take a while. + +Fortunately, you will need to do this only once because you can save your model and reload it later. + + +```python +model.save_pretrained("qwen-2-5-7b-chat-neuron") +``` + +Even better, you can push it to the [Hugging Face hub](https://huggingface.co/models). + +For that, you need to be logged in to a [HuggingFace account](https://huggingface.co/join). + +If you are not connected already on your instance, you will now be prompted for an access token. + +```shell +from huggingface_hub import notebook_login + + +notebook_login(new_session=False) +``` + +By default, the model will be uploaded to your account (organization equal to your user name). + +Feel free to edit the cell below if you want to upload the model to a specific [Hugging Face organization](https://huggingface.co/docs/hub/organizations). + + +```python +from huggingface_hub import whoami + + +org = whoami()['name'] + +repo_id = f"{org}/qwen-2-5-7b-chat-neuron" + +model.push_to_hub("qwen-2-5-7b-chat-neuron", repository_id=repo_id) +``` + +## 2. Generate text using Qwen 2.5 on AWS Inferentia2 + +Once your model has been exported, you can generate text using the transformers library, as it has been described in [detail in this post](https://huggingface.co/blog/how-to-generate). + +If as suggested you skipped the first section, don't worry: we will use a precompiled model already present on the hub instead. + + +```python +from optimum.neuron import NeuronModelForCausalLM + +try: + model +except NameError: + # Edit this to use another base model + model = NeuronModelForCausalLM.from_pretrained('aws-neuron/qwen2-5-7b-chat-neuron') +``` + +We will need a *Qwen 2.5* tokenizer to convert the prompt strings to text tokens. + + +```python +from transformers import AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct") +``` + +The following generation strategies are supported: + +- greedy search, +- multinomial sampling with top-k and top-p (with temperature). + +Most logits pre-processing/filters (such as repetition penalty) are supported. + + +```python +inputs = tokenizer("What is deep-learning ?", return_tensors="pt") +outputs = model.generate(**inputs, + max_new_tokens=128, + do_sample=True, + temperature=0.9, + top_k=50, + top_p=0.9) +tokenizer.batch_decode(outputs, skip_special_tokens=True) +``` + +## 3. Create a chat application using Qwen on AWS Inferentia2 + +The model expects the prompts to be formatted following a specific template corresponding to the interactions between a *user* role and an *assistant* role. + +Each chat model has its own convention for encoding such contents, and we will not go into too much details in this guide, because we will directly use the [Hugging Face chat templates](https://huggingface.co/blog/chat-templates) corresponding to our model. + +The utility function below converts a list of exchanges between the user and the model into a well-formatted chat prompt. + + +```python +def format_chat_prompt(message, history, max_tokens): + """ Convert a history of messages to a chat prompt + + + Args: + message(str): the new user message. + history (List[str]): the list of user messages and assistant responses. + max_tokens (int): the maximum number of input tokens accepted by the model. + + Returns: + a `str` prompt. + """ + chat = [] + # Convert all messages in history to chat interactions + for interaction in history: + chat.append({"role": "user", "content" : interaction[0]}) + chat.append({"role": "assistant", "content" : interaction[1]}) + # Add the new message + chat.append({"role": "user", "content" : message}) + # Generate the prompt, verifying that we don't go beyond the maximum number of tokens + for i in range(0, len(chat), 2): + # Generate candidate prompt with the last n-i entries + prompt = tokenizer.apply_chat_template(chat[i:], tokenize=False) + # Tokenize to check if we're over the limit + tokens = tokenizer(prompt) + if len(tokens.input_ids) <= max_tokens: + # We're good, stop here + return prompt + # We shall never reach this line + raise SystemError +``` + +We are now equipped to build a simplistic chat application. + +We simply store the interactions between the user and the assistant in a list that we use to generate +the input prompt. + + +```python +history = [] +max_tokens = 1024 + +def chat(message, history, max_tokens): + prompt = format_chat_prompt(message, history, max_tokens) + # Uncomment the line below to see what the formatted prompt looks like + #print(prompt) + inputs = tokenizer(prompt, return_tensors="pt") + outputs = model.generate(**inputs, + max_length=2048, + do_sample=True, + temperature=0.9, + top_k=50, + repetition_penalty=1.2) + # Do not include the input tokens + outputs = outputs[0, inputs.input_ids.size(-1):] + response = tokenizer.decode(outputs, skip_special_tokens=True) + history.append([message, response]) + return response +``` + +To test the chat application you can use for instance the following sequence of prompts: + +```python +print(chat("What is deep learning ?", history, max_tokens)) +print(chat("Is deep learning a subset of machine learning ?", history, max_tokens)) +print(chat("Is deep learning a subset of supervised learning ?", history, max_tokens)) +``` + + + +While very powerful, Large language models can sometimes *hallucinate*. We call *hallucinations* generated content that is irrelevant or made-up but presented by the model as if it was accurate. This is a flaw of LLMs and is not a side effect of using them on Trainium / Inferentia. + + diff --git a/notebooks/text-generation/qwen2-5-7b-chatbot.ipynb b/notebooks/text-generation/qwen2-5-7b-chatbot.ipynb new file mode 100644 index 000000000..d36e91852 --- /dev/null +++ b/notebooks/text-generation/qwen2-5-7b-chatbot.ipynb @@ -0,0 +1,387 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "bae23f09", + "metadata": {}, + "source": [ + "# Deploy Qwen 2.5 7B Instruct on AWS Inferentia\n", + "\n", + "This guide will detail how to export, deploy and run a **Qwen2.5 7B Instruct** model on AWS inferentia.\n", + "\n", + "You will learn how to:\n", + "- set up your AWS instance,\n", + "- export the Qwen 2.5 model to the Neuron format,\n", + "- push the exported model to the Hugging Face Hub,\n", + "- deploy the model and use it in a chat application.\n", + "\n", + "Note: This tutorial was created on a inf2.48xlarge AWS EC2 Instance.\n", + "\n", + "## Prerequisite: Setup AWS environment\n", + "\n", + "*you can skip that section if you are already running this notebook on your instance.*\n", + "\n", + "In this example, we will use the *inf2.48xlarge* instance with 12 Neuron devices, corresponding to 24 Neuron Cores and the [Hugging Face Neuron Deep Learning AMI](https://aws.amazon.com/marketplace/pp/prodview-gr3e6yiscria2).\n", + "\n", + "Refer to [this guide](https://huggingface.co/docs/optimum-neuron/en/guides/setup_aws_instance) to set up your instance and configure Jupyter Notebook on it. Make sure to select a inf2.48xlarge AWS EC2 instance.\n", + "\n", + "You can then browse to this notebook (`notebooks/text-generation/qwen2-5-7b-chat`) to continue with the guide.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "44142062", + "metadata": {}, + "outputs": [], + "source": [ + "# Special widgets are required for a nicer display\n", + "!{sys.executable} -m pip install ipywidgets" + ] + }, + { + "cell_type": "markdown", + "id": "bc76e858", + "metadata": {}, + "source": [ + "## 1. Export the Qwen 2.5 model to Neuron\n", + "\n", + "As explained in the [optimum-neuron documentation](https://huggingface.co/docs/optimum-neuron/guides/export_model#why-compile-to-neuron-model)\n", + ", models need to be compiled and exported to a serialized format before running them on Neuron devices.\n", + "\n", + "Fortunately, 🤗 **optimum-neuron** offers an [API](https://huggingface.co/docs/optimum-neuron/guides/models#configuring-the-export-of-a-generative-model)\n", + "to export standard 🤗 [transformers models](https://huggingface.co/docs/transformers/index) to the Neuron format.\n", + "\n", + "When exporting the model, we will specify two sets of parameters:\n", + "\n", + "- using *compiler_args*, we specify on how many cores we want the model to be deployed (each neuron device has two cores), and with which precision (here *bfloat16*),\n", + "- using *input_shapes*, we set the static input and output dimensions of the model. All model compilers require static shapes, and neuron makes no exception. Note that the\n", + "*sequence_length* not only constrains the length of the input context, but also the length of the Key/Value cache, and thus, the output length.\n", + "\n", + "Depending on your choice of parameters and inferentia host, this may take from a few minutes to more than an hour.\n", + "\n", + "For your convenience, we host a pre-compiled version of that model on the Hugging Face hub, so you can skip the export and start using the model immediately in section 2." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "612e39ad", + "metadata": {}, + "outputs": [], + "source": [ + "from optimum.neuron import NeuronModelForCausalLM\n", + "\n", + "\n", + "compiler_args = {\"num_cores\": 24, \"auto_cast_type\": 'bf16'}\n", + "input_shapes = {\"batch_size\": 32, \"sequence_length\": 4096}\n", + "model = NeuronModelForCausalLM.from_pretrained(\n", + " \"Qwen/Qwen2.5-7B-Instruct\",\n", + " export=True,\n", + " **compiler_args,\n", + " **input_shapes)" + ] + }, + { + "cell_type": "markdown", + "id": "25440470", + "metadata": {}, + "source": [ + "This probably took a while.\n", + "\n", + "Fortunately, you will need to do this only once because you can save your model and reload it later." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "63ddcd3a", + "metadata": {}, + "outputs": [], + "source": [ + "model.save_pretrained(\"qwen-2-5-7b-chat-neuron\")" + ] + }, + { + "cell_type": "markdown", + "id": "e221d9ad", + "metadata": {}, + "source": [ + "Even better, you can push it to the [Hugging Face hub](https://huggingface.co/models).\n", + "\n", + "For that, you need to be logged in to a [HuggingFace account](https://huggingface.co/join).\n", + "\n", + "If you are not connected already on your instance, you will now be prompted for an access token." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "762a9e7d", + "metadata": {}, + "outputs": [], + "source": [ + "from huggingface_hub import notebook_login\n", + "\n", + "\n", + "notebook_login(new_session=False)" + ] + }, + { + "cell_type": "markdown", + "id": "856c4cc7", + "metadata": {}, + "source": [ + "By default, the model will be uploaded to your account (organization equal to your user name).\n", + "\n", + "Feel free to edit the cell below if you want to upload the model to a specific [Hugging Face organization](https://huggingface.co/docs/hub/organizations)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f79155c8", + "metadata": {}, + "outputs": [], + "source": [ + "from huggingface_hub import whoami\n", + "\n", + "\n", + "org = whoami()['name']\n", + "\n", + "repo_id = f\"{org}/qwen-2-5-7b-chat-neuron\"\n", + "\n", + "model.push_to_hub(\"qwen-2-5-7b-chat-neuron\", repository_id=repo_id)" + ] + }, + { + "cell_type": "markdown", + "id": "10d21867", + "metadata": {}, + "source": [ + "## 2. Generate text using Llama 2 on AWS Inferentia2\n", + "\n", + "Once your model has been exported, you can generate text using the transformers library, as it has been described in [detail in this post](https://huggingface.co/blog/how-to-generate).\n", + "\n", + "If as suggested you skipped the first section, don't worry: we will use a precompiled model already present on the hub instead." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ac1a7c31", + "metadata": {}, + "outputs": [], + "source": [ + "from optimum.neuron import NeuronModelForCausalLM\n", + "\n", + "\n", + "try:\n", + " model\n", + "except NameError:\n", + " # Edit this to use another base model\n", + " model = NeuronModelForCausalLM.from_pretrained('aws-neuron/qwen2-5-7b-chat-neuron')" + ] + }, + { + "cell_type": "markdown", + "id": "5a034c58", + "metadata": {}, + "source": [ + "We will need a *Qwen 2.5* tokenizer to convert the prompt strings to text tokens." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "832d93bc", + "metadata": {}, + "outputs": [], + "source": [ + "from transformers import AutoTokenizer\n", + "\n", + "\n", + "tokenizer = AutoTokenizer.from_pretrained(\"Qwen/Qwen2.5-7B-Instruct\")" + ] + }, + { + "cell_type": "markdown", + "id": "76a048db", + "metadata": {}, + "source": [ + "The following generation strategies are supported:\n", + "\n", + "- greedy search,\n", + "- multinomial sampling with top-k and top-p (with temperature).\n", + "\n", + "Most logits pre-processing/filters (such as repetition penalty) are supported." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7947684c", + "metadata": {}, + "outputs": [], + "source": [ + "inputs = tokenizer(\"What is deep-learning ?\", return_tensors=\"pt\")\n", + "outputs = model.generate(**inputs,\n", + " max_new_tokens=128,\n", + " do_sample=True,\n", + " temperature=0.9,\n", + " top_k=50,\n", + " top_p=0.9)\n", + "tokenizer.batch_decode(outputs, skip_special_tokens=True)" + ] + }, + { + "cell_type": "markdown", + "id": "1df9e9bd", + "metadata": {}, + "source": [ + "## 3. Create a chat application using Qwen on AWS Inferentia2\n", + "\n", + "The model expects the prompts to be formatted following a specific template corresponding to the interactions between a *user* role and an *assistant* role.\n", + "\n", + "Each chat model has its own convention for encoding such contents, and we will not go into too much details in this guide, because we will directly use the [Hugging Face chat templates](https://huggingface.co/blog/chat-templates) corresponding to our model.\n", + "\n", + "The utility function below converts a list of exchanges between the user and the model into a well-formatted chat prompt." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "db16c699", + "metadata": {}, + "outputs": [], + "source": [ + "def format_chat_prompt(message, history, max_tokens):\n", + " \"\"\" Convert a history of messages to a chat prompt\n", + " Args:\n", + " message(str): the new user message.\n", + " history (List[str]): the list of user messages and assistant responses.\n", + " max_tokens (int): the maximum number of input tokens accepted by the model.\n", + " Returns:\n", + " a `str` prompt.\n", + " \"\"\"\n", + " chat = []\n", + " # Convert all messages in history to chat interactions\n", + " for interaction in history:\n", + " chat.append({\"role\": \"user\", \"content\" : interaction[0]})\n", + " chat.append({\"role\": \"assistant\", \"content\" : interaction[1]})\n", + " # Add the new message\n", + " chat.append({\"role\": \"user\", \"content\" : message})\n", + " # Generate the prompt, verifying that we don't go beyond the maximum number of tokens\n", + " for i in range(0, len(chat), 2):\n", + " # Generate candidate prompt with the last n-i entries\n", + " prompt = tokenizer.apply_chat_template(chat[i:], tokenize=False)\n", + " # Tokenize to check if we're over the limit\n", + " tokens = tokenizer(prompt)\n", + " if len(tokens.input_ids) <= max_tokens:\n", + " # We're good, stop here\n", + " return prompt\n", + " # We shall never reach this line\n", + " raise SystemError" + ] + }, + { + "cell_type": "markdown", + "id": "92cac294", + "metadata": {}, + "source": [ + "We are now equipped to build a simplistic chat application.\n", + "\n", + "We simply store the interactions between the user and the assistant in a list that we use to generate\n", + "the input prompt." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d0bf4952", + "metadata": {}, + "outputs": [], + "source": [ + "history = []\n", + "max_tokens = 1024\n", + "\n", + "def chat(message, history, max_tokens):\n", + " prompt = format_chat_prompt(message, history, max_tokens)\n", + " # Uncomment the line below to see what the formatted prompt looks like\n", + " #print(prompt)\n", + " inputs = tokenizer(prompt, return_tensors=\"pt\")\n", + " outputs = model.generate(**inputs,\n", + " max_length=2048,\n", + " do_sample=True,\n", + " temperature=0.9,\n", + " top_k=50,\n", + " repetition_penalty=1.2)\n", + " # Do not include the input tokens\n", + " outputs = outputs[0, inputs.input_ids.size(-1):]\n", + " response = tokenizer.decode(outputs, skip_special_tokens=True)\n", + " history.append([message, response])\n", + " return response" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3f70e487", + "metadata": {}, + "outputs": [], + "source": [ + "print(chat(\"What is deep learning ?\", history, max_tokens))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c9d344a6", + "metadata": {}, + "outputs": [], + "source": [ + "print(chat(\"Is deep learning a subset of machine learning ?\", history, max_tokens))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "33330967", + "metadata": {}, + "outputs": [], + "source": [ + "print(chat(\"Is deep learning a subset of supervised learning ?\", history, max_tokens))" + ] + }, + { + "cell_type": "markdown", + "id": "38df6da1", + "metadata": {}, + "source": [ + "**Warning**: While very powerful, Large language models can sometimes *hallucinate*. We call *hallucinations* generated content that is irrelevant or made-up but presented by the model as if it was accurate. This is a flaw of LLMs and is not a side effect of using them on Trainium / Inferentia." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.16" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}