diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
index 935556f75..79c6af51c 100644
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -26,6 +26,8 @@
       title: Sentence Transformers on AWS Inferentia
     - local: inference_tutorials/stable_diffusion
       title: Generate images with Stable Diffusion models on AWS Inferentia
+    - local: inference_tutorials/qwen2-5-7b-chatbot
+      title: Deploy Qwen 2.5 7B Instruct on AWS EC2
     title: Inference Tutorials
   - sections:
     - local: guides/setup_aws_instance
diff --git a/docs/source/inference_tutorials/qwen2-5-7b-chatbot.mdx b/docs/source/inference_tutorials/qwen2-5-7b-chatbot.mdx
new file mode 100644
index 000000000..81b7b7aa0
--- /dev/null
+++ b/docs/source/inference_tutorials/qwen2-5-7b-chatbot.mdx
@@ -0,0 +1,226 @@
+<!---
+Copyright 2024 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# Deploy Qwen 2.5 7B Instruct on AWS Inferentia
+
+*There is a notebook version of that tutorial [here](https://github.com/huggingface/optimum-neuron/blob/main/notebooks/text-generation/qwen2-5-7b-chatbot.ipynb)*.
+
+This guide will detail how to export, deploy and run a **Qwen2.5 7B Instruct** model on AWS inferentia.
+
+You will learn how to:
+- set up your AWS instance,
+- export the Qwen 2.5 model to the Neuron format,
+- push the exported model to the Hugging Face Hub,
+- deploy the model and use it in a chat application.
+
+Note: This tutorial was created on a inf2.48xlarge AWS EC2 Instance.
+
+## 1. Export the Qwen 2.5 model to Neuron
+
+As explained in the [optimum-neuron documentation](https://huggingface.co/docs/optimum-neuron/guides/export_model#why-compile-to-neuron-model)
+, models need to be compiled and exported to a serialized format before running them on Neuron devices.
+
+Fortunately, 🤗 **optimum-neuron** offers an [API](https://huggingface.co/docs/optimum-neuron/guides/models#configuring-the-export-of-a-generative-model)
+to export standard 🤗 [transformers models](https://huggingface.co/docs/transformers/index) to the Neuron format.
+
+When exporting the model, we will specify two sets of parameters:
+
+- using *compiler_args*, we specify on how many cores we want the model to be deployed (each neuron device has two cores), and with which precision (here *bfloat16*),
+- using *input_shapes*, we set the static input and output dimensions of the model. All model compilers require static shapes, and neuron makes no exception. Note that the
+*sequence_length* not only constrains the length of the input context, but also the length of the Key/Value cache, and thus, the output length.
+
+Depending on your choice of parameters and inferentia host, this may take from a few minutes to more than an hour.
+
+For your convenience, we host a pre-compiled version of that model on the Hugging Face hub, so you can skip the export and start using the model immediately in section 2.
+
+```python
+from optimum.neuron import NeuronModelForCausalLM
+
+
+compiler_args = {"num_cores": 24, "auto_cast_type": 'bf16'}
+input_shapes = {"batch_size": 32, "sequence_length": 4096}
+model = NeuronModelForCausalLM.from_pretrained(
+        "Qwen/Qwen2.5-7B-Instruct",
+        export=True,
+        **compiler_args,
+        **input_shapes)
+```
+
+This will probably take a while.
+
+Fortunately, you will need to do this only once because you can save your model and reload it later.
+
+
+```python
+model.save_pretrained("qwen-2-5-7b-chat-neuron")
+```
+
+Even better, you can push it to the [Hugging Face hub](https://huggingface.co/models).
+
+For that, you need to be logged in to a [HuggingFace account](https://huggingface.co/join).
+
+If you are not connected already on your instance, you will now be prompted for an access token.
+
+```shell
+from huggingface_hub import notebook_login
+
+
+notebook_login(new_session=False)
+```
+
+By default, the model will be uploaded to your account (organization equal to your user name).
+
+Feel free to edit the cell below if you want to upload the model to a specific [Hugging Face organization](https://huggingface.co/docs/hub/organizations).
+
+
+```python
+from huggingface_hub import whoami
+
+
+org = whoami()['name']
+
+repo_id = f"{org}/qwen-2-5-7b-chat-neuron"
+
+model.push_to_hub("qwen-2-5-7b-chat-neuron", repository_id=repo_id)
+```
+
+## 2. Generate text using Qwen 2.5 on AWS Inferentia2
+
+Once your model has been exported, you can generate text using the transformers library, as it has been described in [detail in this post](https://huggingface.co/blog/how-to-generate).
+
+If as suggested you skipped the first section, don't worry: we will use a precompiled model already present on the hub instead.
+
+
+```python
+from optimum.neuron import NeuronModelForCausalLM
+
+try:
+    model
+except NameError:
+    # Edit this to use another base model
+    model = NeuronModelForCausalLM.from_pretrained('aws-neuron/qwen2-5-7b-chat-neuron')
+```
+
+We will need a *Qwen 2.5* tokenizer to convert the prompt strings to text tokens.
+
+
+```python
+from transformers import AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
+```
+
+The following generation strategies are supported:
+
+- greedy search,
+- multinomial sampling with top-k and top-p (with temperature).
+
+Most logits pre-processing/filters (such as repetition penalty) are supported.
+
+
+```python
+inputs = tokenizer("What is deep-learning ?", return_tensors="pt")
+outputs = model.generate(**inputs,
+                         max_new_tokens=128,
+                         do_sample=True,
+                         temperature=0.9,
+                         top_k=50,
+                         top_p=0.9)
+tokenizer.batch_decode(outputs, skip_special_tokens=True)
+```
+
+## 3. Create a chat application using Qwen on AWS Inferentia2
+
+The model expects the prompts to be formatted following a specific template corresponding to the interactions between a *user* role and an *assistant* role.
+
+Each chat model has its own convention for encoding such contents, and we will not go into too much details in this guide, because we will directly use the [Hugging Face chat templates](https://huggingface.co/blog/chat-templates) corresponding to our model.
+
+The utility function below converts a list of exchanges between the user and the model into a well-formatted chat prompt.
+
+
+```python
+def format_chat_prompt(message, history, max_tokens):
+    """ Convert a history of messages to a chat prompt
+
+
+    Args:
+        message(str): the new user message.
+        history (List[str]): the list of user messages and assistant responses.
+        max_tokens (int): the maximum number of input tokens accepted by the model.
+
+    Returns:
+        a `str` prompt.
+    """
+    chat = []
+    # Convert all messages in history to chat interactions
+    for interaction in history:
+        chat.append({"role": "user", "content" : interaction[0]})
+        chat.append({"role": "assistant", "content" : interaction[1]})
+    # Add the new message
+    chat.append({"role": "user", "content" : message})
+    # Generate the prompt, verifying that we don't go beyond the maximum number of tokens
+    for i in range(0, len(chat), 2):
+        # Generate candidate prompt with the last n-i entries
+        prompt = tokenizer.apply_chat_template(chat[i:], tokenize=False)
+        # Tokenize to check if we're over the limit
+        tokens = tokenizer(prompt)
+        if len(tokens.input_ids) <= max_tokens:
+            # We're good, stop here
+            return prompt
+    # We shall never reach this line
+    raise SystemError
+```
+
+We are now equipped to build a simplistic chat application.
+
+We simply store the interactions between the user and the assistant in a list that we use to generate
+the input prompt.
+
+
+```python
+history = []
+max_tokens = 1024
+
+def chat(message, history, max_tokens):
+    prompt = format_chat_prompt(message, history, max_tokens)
+    # Uncomment the line below to see what the formatted prompt looks like
+    #print(prompt)
+    inputs = tokenizer(prompt, return_tensors="pt")
+    outputs = model.generate(**inputs,
+                             max_length=2048,
+                             do_sample=True,
+                             temperature=0.9,
+                             top_k=50,
+                             repetition_penalty=1.2)
+    # Do not include the input tokens
+    outputs = outputs[0, inputs.input_ids.size(-1):]
+    response = tokenizer.decode(outputs, skip_special_tokens=True)
+    history.append([message, response])
+    return response
+```
+
+To test the chat application you can use for instance the following sequence of prompts:
+
+```python
+print(chat("What is deep learning ?", history, max_tokens))
+print(chat("Is deep learning a subset of machine learning ?", history, max_tokens))
+print(chat("Is deep learning a subset of supervised learning ?", history, max_tokens))
+```
+
+<Warning>
+
+While very powerful, Large language models can sometimes *hallucinate*. We call *hallucinations* generated content that is irrelevant or made-up but presented by the model as if it was accurate. This is a flaw of LLMs and is not a side effect of using them on Trainium / Inferentia.
+
+</Warning>
diff --git a/notebooks/text-generation/qwen2-5-7b-chatbot.ipynb b/notebooks/text-generation/qwen2-5-7b-chatbot.ipynb
new file mode 100644
index 000000000..d36e91852
--- /dev/null
+++ b/notebooks/text-generation/qwen2-5-7b-chatbot.ipynb
@@ -0,0 +1,387 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "bae23f09",
+   "metadata": {},
+   "source": [
+    "# Deploy Qwen 2.5 7B Instruct on AWS Inferentia\n",
+    "\n",
+    "This guide will detail how to export, deploy and run a **Qwen2.5 7B Instruct** model on AWS inferentia.\n",
+    "\n",
+    "You will learn how to:\n",
+    "- set up your AWS instance,\n",
+    "- export the Qwen 2.5 model to the Neuron format,\n",
+    "- push the exported model to the Hugging Face Hub,\n",
+    "- deploy the model and use it in a chat application.\n",
+    "\n",
+    "Note: This tutorial was created on a inf2.48xlarge AWS EC2 Instance.\n",
+    "\n",
+    "## Prerequisite: Setup AWS environment\n",
+    "\n",
+    "*you can skip that section if you are already running this notebook on your instance.*\n",
+    "\n",
+    "In this example, we will use the *inf2.48xlarge* instance with 12 Neuron devices, corresponding to 24 Neuron Cores and the [Hugging Face Neuron Deep Learning AMI](https://aws.amazon.com/marketplace/pp/prodview-gr3e6yiscria2).\n",
+    "\n",
+    "Refer to [this guide](https://huggingface.co/docs/optimum-neuron/en/guides/setup_aws_instance) to set up your instance and configure Jupyter Notebook on it. Make sure to select a inf2.48xlarge AWS EC2 instance.\n",
+    "\n",
+    "You can then browse to this notebook (`notebooks/text-generation/qwen2-5-7b-chat`) to continue with the guide.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "44142062",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Special widgets are required for a nicer display\n",
+    "!{sys.executable} -m pip install ipywidgets"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bc76e858",
+   "metadata": {},
+   "source": [
+    "## 1. Export the Qwen 2.5 model to Neuron\n",
+    "\n",
+    "As explained in the [optimum-neuron documentation](https://huggingface.co/docs/optimum-neuron/guides/export_model#why-compile-to-neuron-model)\n",
+    ", models need to be compiled and exported to a serialized format before running them on Neuron devices.\n",
+    "\n",
+    "Fortunately, 🤗 **optimum-neuron** offers an [API](https://huggingface.co/docs/optimum-neuron/guides/models#configuring-the-export-of-a-generative-model)\n",
+    "to export standard 🤗 [transformers models](https://huggingface.co/docs/transformers/index) to the Neuron format.\n",
+    "\n",
+    "When exporting the model, we will specify two sets of parameters:\n",
+    "\n",
+    "- using *compiler_args*, we specify on how many cores we want the model to be deployed (each neuron device has two cores), and with which precision (here *bfloat16*),\n",
+    "- using *input_shapes*, we set the static input and output dimensions of the model. All model compilers require static shapes, and neuron makes no exception. Note that the\n",
+    "*sequence_length* not only constrains the length of the input context, but also the length of the Key/Value cache, and thus, the output length.\n",
+    "\n",
+    "Depending on your choice of parameters and inferentia host, this may take from a few minutes to more than an hour.\n",
+    "\n",
+    "For your convenience, we host a pre-compiled version of that model on the Hugging Face hub, so you can skip the export and start using the model immediately in section 2."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "612e39ad",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from optimum.neuron import NeuronModelForCausalLM\n",
+    "\n",
+    "\n",
+    "compiler_args = {\"num_cores\": 24, \"auto_cast_type\": 'bf16'}\n",
+    "input_shapes = {\"batch_size\": 32, \"sequence_length\": 4096}\n",
+    "model = NeuronModelForCausalLM.from_pretrained(\n",
+    "        \"Qwen/Qwen2.5-7B-Instruct\",\n",
+    "        export=True,\n",
+    "        **compiler_args,\n",
+    "        **input_shapes)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "25440470",
+   "metadata": {},
+   "source": [
+    "This probably took a while.\n",
+    "\n",
+    "Fortunately, you will need to do this only once because you can save your model and reload it later."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "63ddcd3a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model.save_pretrained(\"qwen-2-5-7b-chat-neuron\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e221d9ad",
+   "metadata": {},
+   "source": [
+    "Even better, you can push it to the [Hugging Face hub](https://huggingface.co/models).\n",
+    "\n",
+    "For that, you need to be logged in to a [HuggingFace account](https://huggingface.co/join).\n",
+    "\n",
+    "If you are not connected already on your instance, you will now be prompted for an access token."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "762a9e7d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from huggingface_hub import notebook_login\n",
+    "\n",
+    "\n",
+    "notebook_login(new_session=False)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "856c4cc7",
+   "metadata": {},
+   "source": [
+    "By default, the model will be uploaded to your account (organization equal to your user name).\n",
+    "\n",
+    "Feel free to edit the cell below if you want to upload the model to a specific [Hugging Face organization](https://huggingface.co/docs/hub/organizations)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f79155c8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from huggingface_hub import whoami\n",
+    "\n",
+    "\n",
+    "org = whoami()['name']\n",
+    "\n",
+    "repo_id = f\"{org}/qwen-2-5-7b-chat-neuron\"\n",
+    "\n",
+    "model.push_to_hub(\"qwen-2-5-7b-chat-neuron\", repository_id=repo_id)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "10d21867",
+   "metadata": {},
+   "source": [
+    "## 2. Generate text using Llama 2 on AWS Inferentia2\n",
+    "\n",
+    "Once your model has been exported, you can generate text using the transformers library, as it has been described in [detail in this post](https://huggingface.co/blog/how-to-generate).\n",
+    "\n",
+    "If as suggested you skipped the first section, don't worry: we will use a precompiled model already present on the hub instead."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ac1a7c31",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from optimum.neuron import NeuronModelForCausalLM\n",
+    "\n",
+    "\n",
+    "try:\n",
+    "    model\n",
+    "except NameError:\n",
+    "    # Edit this to use another base model\n",
+    "    model = NeuronModelForCausalLM.from_pretrained('aws-neuron/qwen2-5-7b-chat-neuron')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5a034c58",
+   "metadata": {},
+   "source": [
+    "We will need a *Qwen 2.5* tokenizer to convert the prompt strings to text tokens."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "832d93bc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import AutoTokenizer\n",
+    "\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\"Qwen/Qwen2.5-7B-Instruct\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "76a048db",
+   "metadata": {},
+   "source": [
+    "The following generation strategies are supported:\n",
+    "\n",
+    "- greedy search,\n",
+    "- multinomial sampling with top-k and top-p (with temperature).\n",
+    "\n",
+    "Most logits pre-processing/filters (such as repetition penalty) are supported."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7947684c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "inputs = tokenizer(\"What is deep-learning ?\", return_tensors=\"pt\")\n",
+    "outputs = model.generate(**inputs,\n",
+    "                         max_new_tokens=128,\n",
+    "                         do_sample=True,\n",
+    "                         temperature=0.9,\n",
+    "                         top_k=50,\n",
+    "                         top_p=0.9)\n",
+    "tokenizer.batch_decode(outputs, skip_special_tokens=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1df9e9bd",
+   "metadata": {},
+   "source": [
+    "## 3. Create a chat application using Qwen on AWS Inferentia2\n",
+    "\n",
+    "The model expects the prompts to be formatted following a specific template corresponding to the interactions between a *user* role and an *assistant* role.\n",
+    "\n",
+    "Each chat model has its own convention for encoding such contents, and we will not go into too much details in this guide, because we will directly use the [Hugging Face chat templates](https://huggingface.co/blog/chat-templates) corresponding to our model.\n",
+    "\n",
+    "The utility function below converts a list of exchanges between the user and the model into a well-formatted chat prompt."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "db16c699",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def format_chat_prompt(message, history, max_tokens):\n",
+    "    \"\"\" Convert a history of messages to a chat prompt\n",
+    "    Args:\n",
+    "        message(str): the new user message.\n",
+    "        history (List[str]): the list of user messages and assistant responses.\n",
+    "        max_tokens (int): the maximum number of input tokens accepted by the model.\n",
+    "    Returns:\n",
+    "        a `str` prompt.\n",
+    "    \"\"\"\n",
+    "    chat = []\n",
+    "    # Convert all messages in history to chat interactions\n",
+    "    for interaction in history:\n",
+    "        chat.append({\"role\": \"user\", \"content\" : interaction[0]})\n",
+    "        chat.append({\"role\": \"assistant\", \"content\" : interaction[1]})\n",
+    "    # Add the new message\n",
+    "    chat.append({\"role\": \"user\", \"content\" : message})\n",
+    "    # Generate the prompt, verifying that we don't go beyond the maximum number of tokens\n",
+    "    for i in range(0, len(chat), 2):\n",
+    "        # Generate candidate prompt with the last n-i entries\n",
+    "        prompt = tokenizer.apply_chat_template(chat[i:], tokenize=False)\n",
+    "        # Tokenize to check if we're over the limit\n",
+    "        tokens = tokenizer(prompt)\n",
+    "        if len(tokens.input_ids) <= max_tokens:\n",
+    "            # We're good, stop here\n",
+    "            return prompt\n",
+    "    # We shall never reach this line\n",
+    "    raise SystemError"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "92cac294",
+   "metadata": {},
+   "source": [
+    "We are now equipped to build a simplistic chat application.\n",
+    "\n",
+    "We simply store the interactions between the user and the assistant in a list that we use to generate\n",
+    "the input prompt."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d0bf4952",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "history = []\n",
+    "max_tokens = 1024\n",
+    "\n",
+    "def chat(message, history, max_tokens):\n",
+    "    prompt = format_chat_prompt(message, history, max_tokens)\n",
+    "    # Uncomment the line below to see what the formatted prompt looks like\n",
+    "    #print(prompt)\n",
+    "    inputs = tokenizer(prompt, return_tensors=\"pt\")\n",
+    "    outputs = model.generate(**inputs,\n",
+    "                             max_length=2048,\n",
+    "                             do_sample=True,\n",
+    "                             temperature=0.9,\n",
+    "                             top_k=50,\n",
+    "                             repetition_penalty=1.2)\n",
+    "    # Do not include the input tokens\n",
+    "    outputs = outputs[0, inputs.input_ids.size(-1):]\n",
+    "    response = tokenizer.decode(outputs, skip_special_tokens=True)\n",
+    "    history.append([message, response])\n",
+    "    return response"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3f70e487",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(chat(\"What is deep learning ?\", history, max_tokens))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c9d344a6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(chat(\"Is deep learning a subset of machine learning ?\", history, max_tokens))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "33330967",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(chat(\"Is deep learning a subset of supervised learning ?\", history, max_tokens))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "38df6da1",
+   "metadata": {},
+   "source": [
+    "**Warning**: While very powerful, Large language models can sometimes *hallucinate*. We call *hallucinations* generated content that is irrelevant or made-up but presented by the model as if it was accurate. This is a flaw of LLMs and is not a side effect of using them on Trainium / Inferentia."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.16"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}