Skip to content

Commit

Permalink
Merge branch 'main' into xuehao/fix_install
Browse files Browse the repository at this point in the history
  • Loading branch information
yiliu30 authored Dec 27, 2024
2 parents f20d8c2 + 3dd8ae6 commit 801cd0a
Show file tree
Hide file tree
Showing 3 changed files with 46 additions and 32 deletions.
42 changes: 23 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,16 +31,26 @@ details and quantized models in several Hugging Face Spaces, e.g. [OPEA](https:/
the [README](./auto_round/mllm/README.md)
* [2024/11] We provide some tips and tricks for LLM&VLM quantization, please check
out [this blog](https://medium.com/@NeuralCompressor/10-tips-for-quantizing-llms-and-vlms-with-autoround-923e733879a7)
* [2024/10] AutoRound has been integrated to [torch/ao](https://github.com/pytorch/ao), check out
their [release note](https://github.com/pytorch/ao/releases/tag/v0.6.1)
* [2024/10] Important update: We now support full-range symmetric quantization and have made it the default
configuration. This configuration is typically better or comparable to asymmetric quantization and significantly
outperforms other symmetric variants, especially at low bit-widths like 2-bit, check
out [some accuracy data](./docs/full_range_sym.md).
* [2024/08] AutoRound format supports Intel Gaudi2 devices. Please refer
to [Intel/Qwen2-7B-int4-inc](https://huggingface.co/Intel/Qwen2-7B-int4-inc).
* [2024/08] AutoRound introduces several experimental features, including fast tuning of norm/bias parameters (for 2-bit
and W4A4, check out [more details](./docs/tuning_norm_bias.md)), activation quantization, and the mx_fp data type.

[//]: # (* [2024/10] AutoRound has been integrated to [torch/ao](https://github.com/pytorch/ao), check out)

[//]: # ( their [release note](https://github.com/pytorch/ao/releases/tag/v0.6.1))

[//]: # (* [2024/10] Important update: We now support full-range symmetric quantization and have made it the default)

[//]: # ( configuration. This configuration is typically better or comparable to asymmetric quantization and significantly)

[//]: # ( outperforms other symmetric variants, especially at low bit-widths like 2-bit, check)

[//]: # ( out [some accuracy data](./docs/full_range_sym.md).)

[//]: # (* [2024/08] AutoRound format supports Intel Gaudi2 devices. Please refer)

[//]: # ( to [Intel/Qwen2-7B-int4-inc](https://huggingface.co/Intel/Qwen2-7B-int4-inc).)

[//]: # (* [2024/08] AutoRound introduces several experimental features, including fast tuning of norm/bias parameters (for 2-bit)

[//]: # ( and W4A4, check out [more details](./docs/tuning_norm_bias.md)), activation quantization, and the mx_fp data type.)

## Installation

Expand Down Expand Up @@ -87,7 +97,7 @@ auto-round \
--model facebook/opt-125m \
--bits 4 \
--group_size 128 \
--format "auto_round,auto_gptq" \
--format "auto_gptq,auto_round" \
--disable_eval \
--output_dir ./tmp_autoround
```
Expand All @@ -98,26 +108,20 @@ We provide two recipes for best accuracy and fast running speed with low memory.

```bash
## best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower
auto-round \
auto-round-best \
--model facebook/opt-125m \
--bits 4 \
--group_size 128 \
--nsamples 512 \
--iters 1000 \
--low_gpu_mem_usage \
--disable_eval
```

```bash
## fast and low memory, 2-3X speedup, slight accuracy drop at W4G128
auto-round \
auto-round-fast \
--model facebook/opt-125m \
--bits 4 \
--group_size 128 \
--nsamples 128 \
--iters 200 \
--seqlen 512 \
--batch_size 4 \
--disable_eval
```

Expand Down
4 changes: 2 additions & 2 deletions auto_round/quantizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -316,13 +316,13 @@ def __init__(self, orig_layer):
self.act_quant_func = self.orig_layer.act_quant_func

def forward(self, x):
tensor_max = self.orig_layer.tensor_max if hasattr(self.orig_layer, "tensor_max") else None
act_max = self.orig_layer.act_max if hasattr(self.orig_layer, "act_max") else None
x, _, _ = self.orig_layer.act_quant_func(x, bits=self.orig_layer.act_bits,
group_size=self.orig_layer.group_size,
scale_dtype=self.orig_layer.scale_dtype,
q_scale_thresh=self.orig_layer.q_scale_thresh,
data_type=self.orig_layer.act_data_type,
tensor_max=tensor_max)
tensor_max=act_max)
return self.orig_layer.forward(x)


Expand Down
32 changes: 21 additions & 11 deletions docs/step_by_step.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,17 +105,27 @@ Please use ',' to split datasets, ':' to split parameters of a dataset and '+' t
```

- **Enable marlin kernel:**
- We support inference repacking for auto_round sym quantized models
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_round import AutoRoundConfig
backend = "cuda_marlin" #supported in auto_round>0.3.1 and 'pip install -v gptqmodel --no-build-isolation')
quantization_config = AutoRoundConfig(backend=backend)
quantized_model_path = "./tmp_autoround"
model = AutoModelForCausalLM.from_pretrained(quantized_model_path,
device_map=backend.split(':')[0], quantization_config=quantization_config)
```
- To leverage auto-gptq marlin kernel, you need to install auto-gptq from source

[//]: # ( - We support inference repacking for auto_round sym quantized models)

[//]: # ( ```python)

[//]: # ( from transformers import AutoModelForCausalLM, AutoTokenizer)

[//]: # ( from auto_round import AutoRoundConfig)

[//]: # ( backend = "cuda_marlin" #supported in auto_round>0.3.1 and 'pip install -v gptqmodel --no-build-isolation'))

[//]: # ( quantization_config = AutoRoundConfig(backend=backend))

[//]: # ( quantized_model_path = "./tmp_autoround")

[//]: # ( model = AutoModelForCausalLM.from_pretrained(quantized_model_path,)

[//]: # ( device_map=backend.split(':')[0], quantization_config=quantization_config))

[//]: # ( ```)
- To leverage auto-gptq marlin kernel, you need to install auto-gptq from source and export the model without sharding.

```bash
auto-round --model facebook/opt-125m --sym --bits 4 --group_size 128 --format "gptq:marlin"
Expand Down

0 comments on commit 801cd0a

Please sign in to comment.