From 478333cc4d1dbd168734af082a0d15fd425130a3 Mon Sep 17 00:00:00 2001
From: Danny <dsemiat@habana.ai>
Date: Wed, 17 Jul 2024 09:36:32 +0300
Subject: [PATCH 1/3] Added SRAM_SLICER_SHARED_MME_INPUT_EXPANSION_ENABLED
 envar to the readme, in AutoGPTQ

---
 examples/text-generation/README.md | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/examples/text-generation/README.md b/examples/text-generation/README.md
index 174a595934..3ea51e87fb 100755
--- a/examples/text-generation/README.md
+++ b/examples/text-generation/README.md
@@ -512,11 +512,14 @@ For more details see [documentation](https://docs.habana.ai/en/latest/PyTorch/Mo
 Llama2-7b in UINT4 weight only quantization is enabled using [AutoGPTQ Fork](https://github.com/HabanaAI/AutoGPTQ), which provides quantization capabilities in PyTorch.
 Currently, the support is for UINT4 inference of pre-quantized models only.
 
-You can run a *UINT4 weight quantized* model using AutoGPTQ with the argument `--gptq`.
+You can run a *UINT4 weight quantized* model using AutoGPTQ by setting the following environment variables:
+`SRAM_SLICER_SHARED_MME_INPUT_EXPANSION_ENABLED=false ENABLE_EXPERIMENTAL_FLAGS=true` before running the command,
+and by adding the argument `--gptq`.
 
 Here is an example to run a quantized model on Llama2-7b `TheBloke/Llama-2-7b-Chat-GPTQ`:
 ```bash
-python run_generation.py \
+SRAM_SLICER_SHARED_MME_INPUT_EXPANSION_ENABLED=false \
+ENABLE_EXPERIMENTAL_FLAGS=true python run_generation.py \
 --attn_softmax_bf16 \
 --model_name_or_path TheBloke/Llama-2-7b-Chat-GPTQ \
 --use_hpu_graphs \

From d681e1f69b4a56d03dc96acdfe4c501644c3b56e Mon Sep 17 00:00:00 2001
From: Danny Semiat <dsemiat@habana.ai>
Date: Wed, 17 Jul 2024 15:09:03 +0300
Subject: [PATCH 2/3] Update README.md with temp solution remark

---
 examples/text-generation/README.md | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/examples/text-generation/README.md b/examples/text-generation/README.md
index 3ea51e87fb..7fd75a0892 100755
--- a/examples/text-generation/README.md
+++ b/examples/text-generation/README.md
@@ -516,6 +516,9 @@ You can run a *UINT4 weight quantized* model using AutoGPTQ by setting the follo
 `SRAM_SLICER_SHARED_MME_INPUT_EXPANSION_ENABLED=false ENABLE_EXPERIMENTAL_FLAGS=true` before running the command,
 and by adding the argument `--gptq`.
 
+***Note:***
+Setting the environment variables is a temporary requirement, and it's done to improve performance.
+
 Here is an example to run a quantized model on Llama2-7b `TheBloke/Llama-2-7b-Chat-GPTQ`:
 ```bash
 SRAM_SLICER_SHARED_MME_INPUT_EXPANSION_ENABLED=false \

From 758a4960acead4ad7c6a03217ede44b1478ce272 Mon Sep 17 00:00:00 2001
From: Danny Semiat <dsemiat@habana.ai>
Date: Wed, 17 Jul 2024 15:32:03 +0300
Subject: [PATCH 3/3] Update README.md

---
 examples/text-generation/README.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/examples/text-generation/README.md b/examples/text-generation/README.md
index 7fd75a0892..451a91e082 100755
--- a/examples/text-generation/README.md
+++ b/examples/text-generation/README.md
@@ -517,7 +517,8 @@ You can run a *UINT4 weight quantized* model using AutoGPTQ by setting the follo
 and by adding the argument `--gptq`.
 
 ***Note:***
-Setting the environment variables is a temporary requirement, and it's done to improve performance.
+Setting the above environment variables improves performance. These variables will be removed in future releases.
+ 
 
 Here is an example to run a quantized model on Llama2-7b `TheBloke/Llama-2-7b-Chat-GPTQ`:
 ```bash