From 478333cc4d1dbd168734af082a0d15fd425130a3 Mon Sep 17 00:00:00 2001 From: Danny Date: Wed, 17 Jul 2024 09:36:32 +0300 Subject: [PATCH 1/3] Added SRAM_SLICER_SHARED_MME_INPUT_EXPANSION_ENABLED envar to the readme, in AutoGPTQ --- examples/text-generation/README.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/examples/text-generation/README.md b/examples/text-generation/README.md index 174a595934..3ea51e87fb 100755 --- a/examples/text-generation/README.md +++ b/examples/text-generation/README.md @@ -512,11 +512,14 @@ For more details see [documentation](https://docs.habana.ai/en/latest/PyTorch/Mo Llama2-7b in UINT4 weight only quantization is enabled using [AutoGPTQ Fork](https://github.com/HabanaAI/AutoGPTQ), which provides quantization capabilities in PyTorch. Currently, the support is for UINT4 inference of pre-quantized models only. -You can run a *UINT4 weight quantized* model using AutoGPTQ with the argument `--gptq`. +You can run a *UINT4 weight quantized* model using AutoGPTQ by setting the following environment variables: +`SRAM_SLICER_SHARED_MME_INPUT_EXPANSION_ENABLED=false ENABLE_EXPERIMENTAL_FLAGS=true` before running the command, +and by adding the argument `--gptq`. Here is an example to run a quantized model on Llama2-7b `TheBloke/Llama-2-7b-Chat-GPTQ`: ```bash -python run_generation.py \ +SRAM_SLICER_SHARED_MME_INPUT_EXPANSION_ENABLED=false \ +ENABLE_EXPERIMENTAL_FLAGS=true python run_generation.py \ --attn_softmax_bf16 \ --model_name_or_path TheBloke/Llama-2-7b-Chat-GPTQ \ --use_hpu_graphs \ From d681e1f69b4a56d03dc96acdfe4c501644c3b56e Mon Sep 17 00:00:00 2001 From: Danny Semiat Date: Wed, 17 Jul 2024 15:09:03 +0300 Subject: [PATCH 2/3] Update README.md with temp solution remark --- examples/text-generation/README.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/examples/text-generation/README.md b/examples/text-generation/README.md index 3ea51e87fb..7fd75a0892 100755 --- a/examples/text-generation/README.md +++ b/examples/text-generation/README.md @@ -516,6 +516,9 @@ You can run a *UINT4 weight quantized* model using AutoGPTQ by setting the follo `SRAM_SLICER_SHARED_MME_INPUT_EXPANSION_ENABLED=false ENABLE_EXPERIMENTAL_FLAGS=true` before running the command, and by adding the argument `--gptq`. +***Note:*** +Setting the environment variables is a temporary requirement, and it's done to improve performance. + Here is an example to run a quantized model on Llama2-7b `TheBloke/Llama-2-7b-Chat-GPTQ`: ```bash SRAM_SLICER_SHARED_MME_INPUT_EXPANSION_ENABLED=false \ From 758a4960acead4ad7c6a03217ede44b1478ce272 Mon Sep 17 00:00:00 2001 From: Danny Semiat Date: Wed, 17 Jul 2024 15:32:03 +0300 Subject: [PATCH 3/3] Update README.md --- examples/text-generation/README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/examples/text-generation/README.md b/examples/text-generation/README.md index 7fd75a0892..451a91e082 100755 --- a/examples/text-generation/README.md +++ b/examples/text-generation/README.md @@ -517,7 +517,8 @@ You can run a *UINT4 weight quantized* model using AutoGPTQ by setting the follo and by adding the argument `--gptq`. ***Note:*** -Setting the environment variables is a temporary requirement, and it's done to improve performance. +Setting the above environment variables improves performance. These variables will be removed in future releases. + Here is an example to run a quantized model on Llama2-7b `TheBloke/Llama-2-7b-Chat-GPTQ`: ```bash