Merge branch 'main' into xuehao/fix_install

intel · Dec 27, 2024 · 801cd0a · 801cd0a
2 parents f20d8c2 + 3dd8ae6
commit 801cd0a
Show file tree

Hide file tree

Showing 3 changed files with 46 additions and 32 deletions.
diff --git a/README.md b/README.md
@@ -31,16 +31,26 @@ details and quantized models in several Hugging Face Spaces, e.g. [OPEA](https:/
   the [README](./auto_round/mllm/README.md)
 * [2024/11] We provide some tips and tricks for LLM&VLM quantization, please check
   out [this blog](https://medium.com/@NeuralCompressor/10-tips-for-quantizing-llms-and-vlms-with-autoround-923e733879a7)
-* [2024/10] AutoRound has been integrated to [torch/ao](https://github.com/pytorch/ao), check out
-  their [release note](https://github.com/pytorch/ao/releases/tag/v0.6.1)
-* [2024/10] Important update: We now support full-range symmetric quantization and have made it the default
-  configuration. This configuration is typically better or comparable to asymmetric quantization and significantly
-  outperforms other symmetric variants, especially at low bit-widths like 2-bit, check
-  out [some accuracy data](./docs/full_range_sym.md).
-* [2024/08] AutoRound format supports Intel Gaudi2 devices. Please refer
-  to [Intel/Qwen2-7B-int4-inc](https://huggingface.co/Intel/Qwen2-7B-int4-inc).
-* [2024/08] AutoRound introduces several experimental features, including fast tuning of norm/bias parameters (for 2-bit
-  and W4A4, check out [more details](./docs/tuning_norm_bias.md)), activation quantization, and the mx_fp data type.
+
+[//]: # (* [2024/10] AutoRound has been integrated to [torch/ao]&#40;https://github.com/pytorch/ao&#41;, check out)
+
+[//]: # (  their [release note]&#40;https://github.com/pytorch/ao/releases/tag/v0.6.1&#41;)
+
+[//]: # (* [2024/10] Important update: We now support full-range symmetric quantization and have made it the default)
+
+[//]: # (  configuration. This configuration is typically better or comparable to asymmetric quantization and significantly)
+
+[//]: # (  outperforms other symmetric variants, especially at low bit-widths like 2-bit, check)
+
+[//]: # (  out [some accuracy data]&#40;./docs/full_range_sym.md&#41;.)
+
+[//]: # (* [2024/08] AutoRound format supports Intel Gaudi2 devices. Please refer)
+
+[//]: # (  to [Intel/Qwen2-7B-int4-inc]&#40;https://huggingface.co/Intel/Qwen2-7B-int4-inc&#41;.)
+
+[//]: # (* [2024/08] AutoRound introduces several experimental features, including fast tuning of norm/bias parameters &#40;for 2-bit)
+
+[//]: # (  and W4A4, check out [more details]&#40;./docs/tuning_norm_bias.md&#41;&#41;, activation quantization, and the mx_fp data type.)
 
 ## Installation
 
@@ -87,7 +97,7 @@ auto-round \
     --model facebook/opt-125m \
     --bits 4 \
     --group_size 128 \
-    --format "auto_round,auto_gptq" \
+    --format "auto_gptq,auto_round" \
     --disable_eval \
     --output_dir ./tmp_autoround
 ```
@@ -98,26 +108,20 @@ We provide two recipes for best accuracy and fast running speed with low memory.
 
   ```bash
 ## best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower
-auto-round \
+auto-round-best \
     --model facebook/opt-125m \
     --bits 4 \
     --group_size 128 \
-    --nsamples 512 \
-    --iters 1000 \
     --low_gpu_mem_usage \
     --disable_eval 
   ```
 
   ```bash
 ## fast and low memory, 2-3X speedup, slight accuracy drop at W4G128
-auto-round \
+auto-round-fast \
     --model facebook/opt-125m \
     --bits 4 \
     --group_size 128 \
-    --nsamples 128 \
-    --iters 200 \
-    --seqlen 512 \
-    --batch_size 4 \
     --disable_eval 
   ```
 

diff --git a/auto_round/quantizer.py b/auto_round/quantizer.py
@@ -316,13 +316,13 @@ def __init__(self, orig_layer):
         self.act_quant_func = self.orig_layer.act_quant_func
 
     def forward(self, x):
-        tensor_max = self.orig_layer.tensor_max if hasattr(self.orig_layer, "tensor_max") else None
+        act_max = self.orig_layer.act_max if hasattr(self.orig_layer, "act_max") else None
         x, _, _ = self.orig_layer.act_quant_func(x, bits=self.orig_layer.act_bits,
                                                  group_size=self.orig_layer.group_size,
                                                  scale_dtype=self.orig_layer.scale_dtype,
                                                  q_scale_thresh=self.orig_layer.q_scale_thresh,
                                                  data_type=self.orig_layer.act_data_type,
-                                                 tensor_max=tensor_max)
+                                                 tensor_max=act_max)
         return self.orig_layer.forward(x)
 
 

diff --git a/docs/step_by_step.md b/docs/step_by_step.md
@@ -105,17 +105,27 @@ Please use ',' to split datasets, ':' to split parameters of a dataset and '+' t
     ```
 
 - **Enable marlin kernel:**
-  - We support inference repacking for auto_round sym quantized models
-  ```python
-  from transformers import AutoModelForCausalLM, AutoTokenizer
-  from auto_round import AutoRoundConfig
-  backend = "cuda_marlin" #supported in auto_round>0.3.1 and 'pip install -v gptqmodel --no-build-isolation')
-  quantization_config = AutoRoundConfig(backend=backend)
-  quantized_model_path = "./tmp_autoround"
-  model = AutoModelForCausalLM.from_pretrained(quantized_model_path,
-                               device_map=backend.split(':')[0], quantization_config=quantization_config)
-  ```
-  - To leverage auto-gptq marlin kernel, you need to install auto-gptq from source
+
+[//]: # (  - We support inference repacking for auto_round sym quantized models)
+
+[//]: # (  ```python)
+
+[//]: # (  from transformers import AutoModelForCausalLM, AutoTokenizer)
+
+[//]: # (  from auto_round import AutoRoundConfig)
+
+[//]: # (  backend = "cuda_marlin" #supported in auto_round>0.3.1 and 'pip install -v gptqmodel --no-build-isolation'&#41;)
+
+[//]: # (  quantization_config = AutoRoundConfig&#40;backend=backend&#41;)
+
+[//]: # (  quantized_model_path = "./tmp_autoround")
+
+[//]: # (  model = AutoModelForCausalLM.from_pretrained&#40;quantized_model_path,)
+
+[//]: # (                               device_map=backend.split&#40;':'&#41;[0], quantization_config=quantization_config&#41;)
+
+[//]: # (  ```)
+  - To leverage auto-gptq marlin kernel, you need to install auto-gptq from source and export the model without sharding.
 
     ```bash
     auto-round --model facebook/opt-125m  --sym --bits 4 --group_size 128  --format "gptq:marlin"