Skip to content

Commit

Permalink
Run quantization.md document from docs/ (#718)
Browse files Browse the repository at this point in the history
* improve updown parser, and use in README.md execution

* cut/paste errors

* typo: true -> false

* we scan each partial line, so need to suppress at partial line level :(

* make it twice as nice

* improved updown parsing

* special handling for lines w/o option

* enable run on quantization doc

* handle white space before trip backtick

* updates

* mps test

* updates

* Update run-readme-pr-macos.yml

Rename test to avoid babe conflict

* Update run-readme-pr.yml

Y

* Update run-readme-pr-mps.yml

2

* typos

* add updown end command

* typo

* move broken mps

* Update parking_lot/run-readme-pr-mps.yml

Co-authored-by: Eli Uriegas <[email protected]>

---------

Co-authored-by: Eli Uriegas <[email protected]>
  • Loading branch information
mikekgfb and seemethere authored May 8, 2024
1 parent 17fc0bb commit 155d484
Show file tree
Hide file tree
Showing 6 changed files with 291 additions and 41 deletions.
30 changes: 23 additions & 7 deletions .github/workflows/run-readme-periodic.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,19 +34,35 @@ jobs:
# )
# echo "::endgroup::"
echo "::group::Create script"
python3 scripts/updown.py --file README.md > ./we-run-this.sh
echo "::group::Create script to run README"
python3 scripts/updown.py --file README.md > ./run-readme.sh
# for good measure, if something happened to updown processor,
# and it did not error out, fail with an exit 1
echo "exit 1" >> ./we-run-this.sh
echo "exit 1" >> ./run-readme.sh
echo "::endgroup::"
echo "::group::Run This"
echo "::group::Run README"
echo "*******************************************"
cat ./we-run-this.sh
cat ./run-readme.sh
echo "*******************************************"
bash -x ./we-run-this.sh
bash -x ./run-readme.sh
echo "::endgroup::"
echo "::group::Create script to run quantization"
python3 scripts/updown.py --file docs/quantization.md > ./run-quantization.sh
# for good measure, if something happened to updown processor,
# and it did not error out, fail with an exit 1
echo "exit 1" >> ./run-quantization.sh
echo "::endgroup::"
echo "::group::Run quantization"
echo "*******************************************"
cat ./run-quantization.sh
echo "*******************************************"
bash -x ./run-quantization.sh
echo "::endgroup::"
echo "::group::Completion"
echo "tests complete"
echo "*******************************************"
echo "::endgroup::"
Expand Down
74 changes: 64 additions & 10 deletions .github/workflows/run-readme-pr-macos.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ on:
workflow_dispatch:
jobs:
test-readme-macos:
runs-on: macos-14-xlarge
runs-on: macos-14-xlarge
steps:
- name: Checkout code
uses: actions/checkout@v2
Expand Down Expand Up @@ -37,20 +37,74 @@ jobs:
# yum install -y devtoolset-10-binutils
# export PATH=/opt/rh/devtoolset-10/root/usr/bin/:$PATH
# echo "::endgroup::"
echo "::group::Create script"
python3 scripts/updown.py --file README.md --replace 'llama3:stories15M,-l 3:-l 2,meta-llama/Meta-Llama-3-8B-Instruct:stories15M' --suppress huggingface-cli,HF_TOKEN > ./we-run-this.sh
# for good measure, if something happened to updown processor,
echo "::group::Create script to run README"
python3 scripts/updown.py --file README.md --replace 'llama3:stories15M,-l 3:-l 2,meta-llama/Meta-Llama-3-8B-Instruct:stories15M' --suppress huggingface-cli,HF_TOKEN > ./run-readme.sh
# for good measure, if something happened to updown processor,
# and it did not error out, fail with an exit 1
echo "exit 1" >> ./we-run-this.sh
echo "exit 1" >> ./run-readme.sh
echo "::endgroup::"
echo "::group::Run This"

echo "::group::Run README"
echo "*******************************************"
cat ./run-readme.sh
echo "*******************************************"
cat ./we-run-this.sh
bash -x ./run-readme.sh
echo "::endgroup::"

echo "::group::Completion"
echo "tests complete"
echo "*******************************************"
bash -x ./we-run-this.sh
echo "::endgroup::"


test-quantization-macos:
runs-on: macos-14-xlarge
steps:
- name: Checkout code
uses: actions/checkout@v2
- uses: actions/setup-python@v4
with:
python-version: '3.10.11'
- name: Setup Xcode
if: runner.os == 'macOS'
uses: maxim-lobanov/setup-xcode@v1
with:
xcode-version: '15.3'
- name: Run script
run: |
set -x
# NS: Remove previous installation of torch first
# as this script does not isntall anything into conda env but rather as system dep
pip3 uninstall -y torch || true
set -eou pipefail
echo "::group::Print machine info"
uname -a
sysctl machdep.cpu.brand_string
sysctl machdep.cpu.core_count
echo "::endgroup::"
# echo "::group::Install newer objcopy that supports --set-section-alignment"
# yum install -y devtoolset-10-binutils
# export PATH=/opt/rh/devtoolset-10/root/usr/bin/:$PATH
# echo "::endgroup::"
echo "::group::Create script to run quantization"
python3 scripts/updown.py --file docs/quantization.md --replace llama3:stories15M --suppress huggingface-cli,HF_TOKEN > ./run-quantization.sh
# for good measure, if something happened to updown processor,
# and it did not error out, fail with an exit 1
echo "exit 1" >> ./run-quantization.sh
echo "::endgroup::"

echo "::group::Run quantization"
echo "*******************************************"
cat ./run-quantization.sh
echo "*******************************************"
bash -x ./run-quantization.sh
echo "::endgroup::"

echo "::group::Completion"
echo "tests complete"
echo "*******************************************"
echo "::endgroup::"
50 changes: 44 additions & 6 deletions .github/workflows/run-readme-pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,19 +25,57 @@ jobs:
# export PATH=/opt/rh/devtoolset-10/root/usr/bin/:$PATH
# echo "::endgroup::"
echo "::group::Create script"
python3 scripts/updown.py --file README.md --replace 'llama3:stories15M,-l 3:-l 2,meta-llama/Meta-Llama-3-8B-Instruct:stories15M' --suppress huggingface-cli,HF_TOKEN > ./we-run-this.sh
echo "::group::Create script to run README"
python3 scripts/updown.py --file README.md --replace 'llama3:stories15M,-l 3:-l 2,meta-llama/Meta-Llama-3-8B-Instruct:stories15M' --suppress huggingface-cli,HF_TOKEN > ./run-readme.sh
# for good measure, if something happened to updown processor,
# and it did not error out, fail with an exit 1
echo "exit 1" >> ./we-run-this.sh
echo "exit 1" >> ./run-readme.sh
echo "::endgroup::"
echo "::group::Run This"
echo "::group::Run README"
echo "*******************************************"
cat ./we-run-this.sh
cat ./run-readme.sh
echo "*******************************************"
bash -x ./we-run-this.sh
bash -x ./run-readme.sh
echo "::endgroup::"
echo "::group::Completion"
echo "tests complete"
echo "*******************************************"
echo "::endgroup::"
test-quantization-any:
uses: pytorch/test-infra/.github/workflows/linux_job.yml@main
with:
runner: linux.g5.4xlarge.nvidia.gpu
gpu-arch-type: cuda
gpu-arch-version: "12.1"
timeout: 60
script: |
echo "::group::Print machine info"
uname -a
echo "::endgroup::"
# echo "::group::Install newer objcopy that supports --set-section-alignment"
# yum install -y devtoolset-10-binutils
# export PATH=/opt/rh/devtoolset-10/root/usr/bin/:$PATH
# echo "::endgroup::"
echo "::group::Create script to run quantization"
python3 scripts/updown.py --file docs/quantization.md --replace llama3:stories15M --suppress huggingface-cli,HF_TOKEN > ./run-quantization.sh
# for good measure, if something happened to updown processor,
# and it did not error out, fail with an exit 1
echo "exit 1" >> ./run-quantization.sh
echo "::endgroup::"
echo "::group::Run quantization"
echo "*******************************************"
cat ./run-quantization.sh
echo "*******************************************"
bash -x ./run-quantization.sh
echo "::endgroup::"
echo "::group::Completion"
echo "tests complete"
echo "*******************************************"
echo "::endgroup::"
80 changes: 63 additions & 17 deletions docs/quantization.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@

# Quantization

[shell default]: HF_TOKEN="${SECRET_HF_TOKEN_PERIODIC}" huggingface-cli login
[shell default]: TORCHCHAT_ROOT=${PWD} ./scripts/install_et.sh

## Introduction
Quantization focuses on reducing the precision of model parameters and computations from floating-point to lower-bit integers, such as 8-bit integers. This approach aims to minimize memory requirements, accelerate inference speeds, and decrease power consumption, making models more feasible for deployment on edge devices with limited computational resources. For high-performance devices such as GPUs, quantization provides a way to reduce the required memory bandwidth and take advantage of the massive compute capabilities provided by today's server-based accelerators such as GPUs.

Expand All @@ -13,36 +16,68 @@ While quantization can potentially degrade the model's performance, the methods
| linear (asymmetric) | fp32, fp16, bf16 | [8, 4]* | [32, 64, 128, 256]** | ||| 🚧 |
| linear with GPTQ*** (asymmetric) | | |[32, 64, 128, 256]** | ||||
| linear with HQQ*** (asymmetric) | | |[32, 64, 128, 256]** | ||||
| linear with dynamic activations (symmetric) | fp32^ | | [32, 64, 128, 256] | a8w4dq | 🚧 |🚧 ||
| linear with dynamic activations (symmetric) | fp32^ | | [32, 64, 128, 256]* | a8w4dq | 🚧 |🚧 ||

### Embedding Quantization
Due to the larger vocabulary size of llama3, we also recommend quantizing the embeddings to further reduce the model size for on-device usecases.

Due to the larger vocabulary size of llama3, we also recommend
quantizing the embeddings to further reduce the model size for
on-device usecases.

| compression | FP Precision | weight quantization (bitwidth)| weight quantization (group size) | dynamic activation quantization | Eager | AOTI | ExecuTorch |
|--|--|--|--|--|--|--|--|
| embedding (symmetric) | fp32, fp16, bf16 | [8, 4]* | [32, 64, 128, 256]** | ||||
| embedding (symmetric) | fp32, fp16, bf16 | [8, 4]* | [ any > 1 ] | ||||

^a8w4dq quantization scheme requires model to be converted to fp32, due to lack of support for fp16 and bf16 in the kernels provided with ExecuTorch.
^ a8w4dq quantization scheme requires model to be converted to fp32,
due to lack of support for fp16 and bf16 in the kernels provided with
ExecuTorch.

* These are the only valid bitwidth options.

** There are many valid group size options, including 512, 1024, etc. Note that smaller groupsize tends to be better for preserving model quality and accuracy, and larger groupsize for further improving performance. Set 0 for channelwise quantization.
** There are many valid group size options, including 512, 1024,
etc. Note that smaller groupsize tends to be better for preserving
model quality and accuracy, and larger groupsize for further
improving performance. Set 0 for channelwise quantization.

*** [GPTQ](https://arxiv.org/abs/2210.17323) and [HQQ](https://mobiusml.github.io/hqq_blog/) are two different algorithms to address accuracy loss when using lower bit quantization. Due to HQQ relying on data/calibration free quantization, it tends to take less time to quantize model.
*** [GPTQ](https://arxiv.org/abs/2210.17323) and
[HQQ](https://mobiusml.github.io/hqq_blog/) are two different
algorithms to address accuracy loss when using lower bit
quantization. Due to HQQ relying on data/calibration free
quantization, it tends to take less time to quantize model.

## Quantization Profiles
Torchchat quantization supports profiles with multiple settings such as accelerator, dtype, and quantization specified in a JSON file. Four sample profiles are included wwith the torchchat distributin in config/data: `cuda.json`, `desktop.json`, `mobile.json`, `pi5.json` with profiles optimizing for execution on cuda, desktop, mobile and raspberry Pi devices.

In addition to quantization recipes described below, the profiles also enable developers to specify the accelerator and dtype to be used.

At present torchchat supports the fast, cuda, mps, and cpu devices. The default device in torchchat is "fast". The "fast" device is a virtual device that defaults to the fastest executor available in the system, selecting cuda, mps, and cpu in this order.

At present torchchat supports the fast16, fast, bf16, fp16 and fp32 data types. The default data type for models is "fast16". The "fast16" data type is a virtual data type that defaults to the best 16-bit floating point data type available on the selected device. The "fast" data type is a virtual data type that defaults to the best floating point data type available on the selected device. ("Best" tangibly representing a combination of speed and accuracy.)
Torchchat quantization supports profiles with multiple settings such
as accelerator, dtype, and quantization specified in a JSON file.
Four sample profiles are included wwith the torchchat distributin in
config/data: `cuda.json`, `desktop.json`, `mobile.json`, `pi5.json`
with profiles optimizing for execution on cuda, desktop, mobile and
raspberry Pi devices.

In addition to quantization recipes described below, the profiles also
enable developers to specify the accelerator and dtype to be used.

At present torchchat supports the fast, cuda, mps, and cpu devices.
The default device in torchchat is "fast". The "fast" device is a
virtual device that defaults to the fastest executor available in the
system, selecting cuda, mps, and cpu in this order.

At present torchchat supports the fast16, fast, bf16, fp16 and fp32
data types. The default data type for models is "fast16". The
"fast16" data type is a virtual data type that defaults to the best
16-bit floating point data type available on the selected device. The
"fast" data type is a virtual data type that defaults to the best
floating point data type available on the selected device. ("Best"
tangibly representing a combination of speed and accuracy.)

## Quantization API
Quantization options are passed in json format either as a config file (see [cuda.json](../config/data/cuda.json) and [mobile.json](../config/data/mobile.json)) or a JSON string.

The expected JSON format is described below. Refer to the tables above for valid `bitwidth` and `groupsize` values.
Quantization options are passed in json format either as a config file
(see [cuda.json](../config/data/cuda.json) and
[mobile.json](../config/data/mobile.json)) or a JSON string.

The expected JSON format is described below. Refer to the tables above
for valid `bitwidth` and `groupsize` values.

| compression | JSON string |
|--|--|
Expand All @@ -57,6 +92,7 @@ See the available quantization schemes [here](https://github.com/pytorch/torchch
## Examples
We can mix and match weight quantization with embedding quantization.

[skip default]: begin
* Config file
```
--quantize quant_config.json
Expand All @@ -69,16 +105,22 @@ We can mix and match weight quantization with embedding quantization.
```
--quantize '{"embedding": {"bitwidth": 4, "groupsize":32}, "linear:a8w4dq": {"groupsize" : 256}}'
```
Quantization recipes can be applied in conjunction with any of the `chat`, `generate`, `browser` and `export` commands. Below are examples showcasing eager mode with `generate` and AOTI and ExecuTorch with `export`.
[skip default]: end

Quantization recipes can be applied in conjunction with any of the
`chat`, `generate`, `browser` and `export` commands. Below are
examples showcasing eager mode with `generate` and AOTI and ExecuTorch
with `export`.

### Eager mode
```
python3 generate.py [--compile] llama3 --prompt "Hello, my name is" --quantize '{"embedding" : {"bitwidth": 8, "groupsize": 0}}' --device cpu
```
### AOTI
```
python3 torchchat.py export llama3 --quantize '{"embedding": {"bitwidth": 4, "groupsize":32}, "linear:int4": {"groupsize" : 256}}' --output-dso-path llama3.dso
python3 torchchat.py export llama3 --quantize '{"embedding": {"bitwidth": 4, "groupsize":32}, "linear:int4": {"groupsize" : 256}}' --output-dso-path llama3.so
python3 generate.py llama3 --dso-path llama3.dso --prompt "Hello my name is"
python3 generate.py llama3 --dso-path llama3.so --prompt "Hello my name is"
```
### ExecuTorch
```
Expand All @@ -90,10 +132,12 @@ python3 generate.py llama3 --pte-path llama3.pte --prompt "Hello my name is"
## Model precision (dtype precision setting)
On top of quantizing models with integer quantization schemes mentioned above, models can be converted to lower bit floating point precision to reduce the memory bandwidth requirement and take advantage of higher density compute available. For example, many GPUs and some of the CPUs have good support for BFloat16 and Float16. This can be taken advantage of via `--dtype` arg as shown below.

[skip default]: begin
```
python3 generate.py --dtype [ fast16 | fast | bf16 | fp16 | fp32] ...
python3 export.py --dtype [ fast16 | fast | bf16 | fp16 | fp32] ...
```
[skip default]: end

Unlike gpt-fast which uses bfloat16 as default, torchchat uses the dtype "fast16" as the default. Torchchat will pick the appropriate 16-bit floating point type available and offering the best performance (for execution with Executorch, macOS/ARM and Linux/x86 platforms). For macOS, support depends on the OS version, with versions starting with 14.0 supporting bfloat16 as support, and float16 for earlier OS version based on system support for these data types.

Expand All @@ -109,3 +153,5 @@ We invite contributors to submit established quantization schemes, with accuracy
- Quantization reference, describe options for --quantize parameter
- Show a table with performance/accuracy metrics
- Quantization support matrix? torchchat Quantization Support Matrix

[end default]: end
Loading

0 comments on commit 155d484

Please sign in to comment.