Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation and align project name #30

Merged
merged 8 commits into from
Jul 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Contribution Guidelines
7. [Contributor Covenant Code of Conduct](#contributor-covenant-code-of-conduct)

## Create Pull Request
If you have improvements to ONNX Neural Compressor, send your pull requests for
If you have improvements to Neural Compressor, send your pull requests for
[review](https://github.com/onnx/neural-compressor/pulls).
If you are new to GitHub, view the pull request [How To](https://help.github.com/articles/using-pull-requests/).
### Step-by-Step guidelines
Expand All @@ -27,7 +27,7 @@ If you are new to GitHub, view the pull request [How To](https://help.github.com
Before sending your pull requests, follow the information below:

- Add unit tests in [Unit Tests](https://github.com/onnx/neural-compressor/tree/main/test) to cover the code you would like to contribute.
- ONNX Neural Compressor has adopted the [Developer Certificate of Origin](https://en.wikipedia.org/wiki/Developer_Certificate_of_Origin), you must agree to the terms of Developer Certificate of Origin by signing off each of your commits with `-s`, e.g. `git commit -s -m 'This is my commit message'`.
- Neural Compressor has adopted the [Developer Certificate of Origin](https://en.wikipedia.org/wiki/Developer_Certificate_of_Origin), you must agree to the terms of Developer Certificate of Origin by signing off each of your commits with `-s`, e.g. `git commit -s -m 'This is my commit message'`.

## Pull Request Template

Expand All @@ -43,7 +43,7 @@ See [PR template](/.github/pull_request_template.md)
- Third-party dependency license compatible

## Pull Request Status Checks Overview
ONNX Neural Compressor use [Azure DevOps](https://learn.microsoft.com/en-us/azure/devops/pipelines/?view=azure-devops) for CI test.
Neural Compressor use [Azure DevOps](https://learn.microsoft.com/en-us/azure/devops/pipelines/?view=azure-devops) for CI test.
And generally use [Azure Cloud Instance](https://azure.microsoft.com/en-us/pricing/purchase-options/pay-as-you-go) to deploy pipelines, e.g. Standard E16s v5.
| Test Name | Test Scope | Test Pass Criteria |
|-------------------------------|-----------------------------------------------|---------------------------|
Expand Down
75 changes: 75 additions & 0 deletions docs/autotune.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
AutoTune
========================================

1. [Overview](#overview)
2. [How it Works](#how-it-works)
3. [Working with Autotune](#working-with-autotune)
4. [Get Started](#get-started)


## Overview

Neural Compressor aims to help users quickly deploy low-precision models by leveraging popular compression techniques, such as post-training quantization and weight-only quantization algorithms. Despite having a variety of these algorithms, finding the appropriate configuration for a model can be difficult and time-consuming. To address this, we built the `autotune` module which identifies the best algorithm configuration for models to achieve optimal performance under the certain accuracy criteria. This module allows users to easily use predefined tuning recipes and customize the tuning space as needed.

## How it Works

The autotune module constructs the tuning space according to the pre-defined tuning set or users' tuning set. It iterates the tuning space and applies the configuration on given float model then records and compares its evaluation result with the baseline. The tuning process stops when meeting the exit policy.
The workflow is as below:

<a target="_blank" href="imgs/workflow.png">
<img src="imgs/workflow.png" alt="Workflow">
</a>


## Working with Autotune

The `autotune` API can be used across all algorithms supported by Neural Compressor. It accepts three primary arguments: `model_input`, `tune_config`, and `eval_fn`.

The `TuningConfig` class defines the tuning process, including the tuning space, order, and exit policy.

- Define the tuning space

User can define the tuning space by setting `config_set` with an algorithm configuration or a set of configurations.
```python
# Use the default tuning space
config_set = config.get_woq_tuning_config()

# Customize the tuning space with one algorithm configurations
config_set = config.RTNConfig(weight_sym=False, weight_group_size=[32, 64])

# Customize the tuning space with two algorithm configurations
config_set = [
config.RTNConfig(weight_sym=False, weight_group_size=32),
config.GPTQConfig(weight_group_size=128, weight_sym=False),
]
```

- Define the tuning order

The tuning order determines how the process traverses the tuning space and samples configurations. Users can customize it by configuring the `sampler`. Currently, we provide the [`default_sampler`](https://github.com/onnx/neural-compressor/blob/main/onnx_neural_compressor/quantization/tuning.py#L210), which samples configurations sequentially, always in the same order.

- Define the exit policy

The exit policy includes two components: accuracy goal (`tolerable_loss`) and the allowed number of trials (`max_trials`). The tuning process will stop when either condition is met.

## Get Started
The example below demonstrates how to autotune a ONNX model on four `RTNConfig` configurations.

```python
from onnx_neural_compressor.quantization import config, tuning


def eval_fn(model) -> float:
return ...


tune_config = tuning.TuningConfig(
config_set=config.RTNConfig(
weight_sym=[False, True],
weight_group_size=[32, 128]
),
tolerable_loss=0.2,
max_trials=10,
)
q_model = tuning.autotune(model, tune_config=tune_config, eval_fn=eval_fn)
```
4 changes: 2 additions & 2 deletions docs/calibration.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,15 @@ Quantization proves beneficial in terms of reducing the memory and computational

## Calibration Algorithms

Currently, ONNX Neural Compressor supports three popular calibration algorithms:
Currently, Neural Compressor supports three popular calibration algorithms:

- MinMax: This method gets the maximum and minimum of input values as $α$ and $β$ [^1]. It preserves the entire range and is the simplest approach.

- Entropy: This method minimizes the KL divergence to reduce the information loss between full-precision and quantized data [^2]. Its primary focus is on preserving essential information.

- Percentile: This method only considers a specific percentage of values for calculating the range, ignoring the remainder which may contain outliers [^3]. It enhances resolution by excluding extreme values but still retaining noteworthy data.

> `kl` is used to represent the Entropy calibration algorithm in ONNX Neural Compressor.
> `kl` is used to represent the Entropy calibration algorithm in Neural Compressor.

## Reference

Expand Down
2 changes: 1 addition & 1 deletion docs/design.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Design
=====
ONNX Neural Compressor features an architecture and workflow that aids in increasing performance and faster deployments across infrastructures.
Neural Compressor features an architecture and workflow that aids in increasing performance and faster deployments across infrastructures.

## Architecture

Expand Down
Binary file modified docs/imgs/workflow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/installation_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ The following prerequisites and requirements must be satisfied for a successful
## System Requirements

### Validated Hardware Environment
#### ONNX Neural Compressor supports CPUs based on [Intel 64 architecture or compatible processors](https://en.wikipedia.org/wiki/X86-64):
#### Neural Compressor supports CPUs based on [Intel 64 architecture or compatible processors](https://en.wikipedia.org/wiki/X86-64):

* Intel Xeon Scalable processor (formerly Skylake, Cascade Lake, Cooper Lake, Ice Lake, and Sapphire Rapids)
* Intel Xeon CPU Max Series (formerly Sapphire Rapids HBM)
Expand Down
118 changes: 31 additions & 87 deletions docs/quantization.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,14 @@ Quantization

1. [Quantization Introduction](#quantization-introduction)
2. [Quantization Fundamentals](#quantization-fundamentals)
3. [Accuracy Aware Tuning](#with-or-without-accuracy-aware-tuning)
4. [Get Started](#get-started)
4.1 [Post Training Quantization](#post-training-quantization)
4.2 [Specify Quantization Rules](#specify-quantization-rules)
4.3 [Specify Quantization Backend and Device](#specify-quantization-backend-and-device)
5. [Examples](#examples)
3. [Get Started](#get-started)

3.1 [Post Training Quantization](#post-training-quantization)
thuang6 marked this conversation as resolved.
Show resolved Hide resolved

3.2 [Specify Quantization Rules](#specify-quantization-rules)

3.3 [Specify Quantization Backend and Device](#specify-quantization-backend-and-device)
4. [Examples](#examples)

## Quantization Introduction

Expand All @@ -18,19 +20,19 @@ Quantization is a very popular deep learning model optimization technique invent

`Affine quantization` and `Scale quantization` are two common range mapping techniques used in tensor conversion between different data types.

The math equation is like: $$X_{int8} = round(Scale \times X_{fp32} + ZeroPoint)$$.
The math equation is like: $X_{int8} = round(Scale \times X_{fp32} + ZeroPoint)$.

**Affine Quantization**

This is so-called `asymmetric quantization`, in which we map the min/max range in the float tensor to the integer range. Here int8 range is [-128, 127], uint8 range is [0, 255].
This is so-called `Asymmetric quantization`, in which we map the min/max range in the float tensor to the integer range. Here int8 range is [-128, 127], uint8 range is [0, 255].

here:

If INT8 is specified, $Scale = (|X_{f_{max}} - X_{f_{min}}|) / 127$ and $ZeroPoint = -128 - X_{f_{min}} / Scale$.
If INT8 is specified, $Scale = (|X_{max} - X_{min}|) / 127$ and $ZeroPoint = -128 - X_{min} / Scale$.

or

If UINT8 is specified, $Scale = (|X_{f_{max}} - X_{f_{min}}|) / 255$ and $ZeroPoint = - X_{f_{min}} / Scale$.
If UINT8 is specified, $Scale = (|X_{max} - X_{min}|) / 255$ and $ZeroPoint = - X_{min} / Scale$.

**Scale Quantization**

Expand All @@ -40,11 +42,11 @@ The math equation is like:

here:

If INT8 is specified, $Scale = max(abs(X_{f_{max}}), abs(X_{f_{min}})) / 127$ and $ZeroPoint = 0$.
If INT8 is specified, $Scale = max(abs(X_{max}), abs(X_{min})) / 127$ and $ZeroPoint = 0$.

or

If UINT8 is specified, $Scale = max(abs(X_{f_{max}}), abs(X_{f_{min}})) / 255$ and $ZeroPoint = 128$.
If UINT8 is specified, $Scale = max(abs(X_{max}), abs(X_{min})) / 255$ and $ZeroPoint = 128$.

*NOTE*

Expand All @@ -54,29 +56,25 @@ Sometimes the reduce_range feature, that's using 7 bit width (1 sign bit + 6 dat

| Framework | Backend Library | Symmetric Quantization | Asymmetric Quantization |
| :-------------- |:---------------:| ---------------:|---------------:|
| ONNX Runtime | [MLAS](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/core/mlas) | Weight (int8) | Activation (uint8) |
| ONNX Runtime | [MLAS](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/core/mlas) | Activation (int8/uint8), Weight (int8/uint8) | Activation (int8/uint8), Weight (int8/uint8) |

> ***Note***
>
> Activation (uint8) + Weight (int8) is recommended for performance on x86-64 machines with AVX2 and AVX512 extensions.

#### Quantization Scheme
+ Symmetric Quantization
+ int8: scale = 2 * max(abs(rmin), abs(rmax)) / (max(int8) - min(int8) - 1)
+ Asymmetric Quantization
+ uint8: scale = (rmax - rmin) / (max(uint8) - min(uint8)); zero_point = min(uint8) - round(rmin / scale)
chensuyue marked this conversation as resolved.
Show resolved Hide resolved

#### Reference
+ MLAS: [MLAS Quantization](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/quantization/onnx_quantizer.py)

### Quantization Approaches

Quantization has three different approaches:
Quantization has two different approaches which belong to optimization on inference:
1) post training dynamic quantization
2) post training static quantization

The first two approaches belong to optimization on inference. The last belongs to optimization during training. Currently. ONNX Runtime doesn't support the last one.

#### Post Training Dynamic Quantization

The weights of the neural network get quantized into int8 format from float32 format offline. The activations of the neural network is quantized as well with the min/max range collected during inference runtime.
The weights of the neural network get quantized into 8 bits format from float32 format offline. The activations of the neural network is quantized as well with the min/max range collected during inference runtime.

This approach is widely used in dynamic length neural networks, like NLP model.

Expand All @@ -86,42 +84,20 @@ Compared with `post training dynamic quantization`, the min/max range in weights

This approach is major quantization approach people should try because it could provide the better performance comparing with `post training dynamic quantization`.

## With or Without Accuracy Aware Tuning

Accuracy aware tuning is one of unique features provided by Neural Compressor, compared with other 3rd party model compression tools. This feature can be used to solve accuracy loss pain points brought by applying low precision quantization and other lossy optimization methods.

This tuning algorithm creates a tuning space based on user-defined configurations, generates quantized graph, and evaluates the accuracy of this quantized graph. The optimal model will be yielded if the pre-defined accuracy goal is met.

Neural compressor also support to quantize all quantizable ops without accuracy tuning, user can decide whether to tune the model accuracy or not. Please refer to "Get Start" below.

### Working Flow

Currently `accuracy aware tuning` only supports `post training quantization`.

User could refer to below chart to understand the whole tuning flow.

<img src="./imgs/workflow.png" width=600 height=280 alt="accuracy aware tuning working flow">


## Get Started

The design philosophy of the quantization interface of ONNX Neural Compressor is easy-of-use. It requests user to provide `model`, `calibration dataloader`, and `evaluation function`. Those parameters would be used to quantize and tune the model.
The design philosophy of the quantization interface of Neural Compressor is easy-of-use. It requests user to provide `model_input`, `model_output` and `quant_config`. Those parameters would be used to quantize and save the model.

`model` is the framework model location or the framework model object.
`model_input` is the ONNX model location or the ONNX model object.

`calibration dataloader` is used to load the data samples for calibration phase. In most cases, it could be the partial samples of the evaluation dataset.
`model_output` is the path to save ONNX model.

If a user needs to tune the model accuracy, the user should provide `evaluation function`.
`quant_config` is the configuration to do quantization.

`evaluation function` is a function used to evaluate model accuracy. It is a optional. This function should be same with how user makes evaluation on fp32 model, just taking `model` as input and returning a scalar value represented the evaluation accuracy.
User could leverage Neural Compressor to directly generate a fully quantized model without accuracy validation. Currently, Neural Compressor supports `Post Training Static Quantization` and `Post Training Dynamic Quantization`.

User could execute:
### Post Training Quantization

1. Without Accuracy Aware Tuning

This means user could leverage ONNX Neural Compressor to directly generate a fully quantized model without accuracy aware tuning. It's user responsibility to ensure the accuracy of the quantized model meets expectation. ONNX Neural Compressor supports `Post Training Static Quantization` and `Post Training Dynamic Quantization`.

``` python
from onnx_neural_compressor.quantization import quantize, config
from onnx_neural_compressor import data_reader
Expand All @@ -138,47 +114,15 @@ qconfig = config.StaticQuantConfig(calibration_data_reader) # or qconfig = Dyna
quantize(model, q_model_path, qconfig)
```

2. With Accuracy Aware Tuning

This means user could leverage the advance feature of ONNX Neural Compressor to tune out a best quantized model which has best accuracy and good performance. User should provide `eval_fn`.

``` python
from onnx_neural_compressor import data_reader
from onnx_neural_compressor.quantization import tuning, config

class DataReader(data_reader.CalibrationDataReader):
def get_next(self): ...

def rewind(self): ...


data_reader = DataReader()

# TuningConfig can accept:
# 1) a set of candidate configs like tuning.TuningConfig(config_set=[config.RTNConfig(weight_bits=4), config.GPTQConfig(weight_bits=4)])
# 2) one config with a set of candidate parameters like tuning.TuningConfig(config_set=[config.GPTQConfig(weight_group_size=[32, 64])])
# 3) our pre-defined config set like tuning.TuningConfig(config_set=config.get_woq_tuning_config())
custom_tune_config = tuning.TuningConfig(config_set=[config.RTNConfig(weight_bits=4), config.GPTQConfig(weight_bits=4)])
best_model = tuning.autotune(
model_input=model,
tune_config=custom_tune_config,
eval_fn=eval_fn,
calibration_data_reader=data_reader,
)
```

### Specify Quantization Rules
ONNX Neural Compressor support specify quantization rules by operator name. Users can use `set_local` API of configs to achieve the above purpose by below code:
Neural Compressor support specify quantization rules by operator name. Users can use `set_local` API of configs to achieve the above purpose by below code:

```python
fp32_config = config.GPTQConfig(weight_dtype="fp32")
quant_config = config.GPTQConfig(
weight_bits=4,
weight_dtype="int",
weight_sym=False,
weight_group_size=32,
op_config = config.StaticQuantConfig(per_channel=False)
quant_config = config.StaticQuantConfig(
per_channel=True,
)
quant_config.set_local("/h.4/mlp/fc_out/MatMul", fp32_config)
quant_config.set_local("/h.4/mlp/fc_out/MatMul", op_config)
```


Expand Down Expand Up @@ -235,4 +179,4 @@ Neural-Compressor will quantized models with user-specified backend or detecting

## Examples

User could refer to [examples](../../examples/onnxrt) on how to quantize a new model.
User could refer to [examples](../../examples) on how to quantize a new model.
2 changes: 1 addition & 1 deletion onnx_neural_compressor/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""ONNX Neural Compressor: An open-source Python library supporting popular model compression techniques for ONNXRuntime Framework."""
"""Neural Compressor: An open-source Python library supporting popular model compression techniques for ONNX models."""
2 changes: 1 addition & 1 deletion onnx_neural_compressor/version.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,5 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""ONNX Neural Compressor: An open-source Python library supporting popular model compression techniques for ONNX."""
"""Neural Compressor: An open-source Python library supporting popular model compression techniques for ONNX models."""
__version__ = "1.0"
Loading