Skip to content

Commit

Permalink
Q/DQ docs readability + 4bit info in onnx.proto (onnx#5937)
Browse files Browse the repository at this point in the history
* Added line breaks in QuantizeLinear and DequantizeLinear operators
documentation for readability in markdown format.

* Added additional information about UINT4/INT4 in onnx.proto

---------

Signed-off-by: Gal Hubara Agam <[email protected]>
Co-authored-by: G. Ramalingam <[email protected]>
  • Loading branch information
galagam and gramalingam authored Feb 16, 2024
1 parent 00c2f02 commit 6417cb0
Show file tree
Hide file tree
Showing 8 changed files with 76 additions and 45 deletions.
27 changes: 17 additions & 10 deletions docs/Changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -24770,11 +24770,12 @@ This version of the operator has been available since version 21 of the default
full-precision tensor. The dequantization formula is `y = (x - x_zero_point) * x_scale`. `x_scale` and `x_zero_point`
must have the same shape, determining the quantization's granularity: a scalar for per-tensor/per-layer quantization,
a 1-D tensor for per-axis quantization, or have a rank identical to the input for blocked quantization.
See QuantizeLinear for details on quantization granularity."
See QuantizeLinear for details on quantization granularity.

`x_zero_point` and `x` must have the same type. `x` and `y` must have the same shape. In the case of dequantizing
`int32`, there's no zero point (zero point is supposed to be 0).
`zero-point` is usually not used in the case of float8e4m3fn, float8e4m3fnuz, float8e5m2, float8e5m2fnuz quantization,
but the dequantization formula remains the same for consistency, and `x_scale` still determines the output type.
`zero-point` is usually not used in the case of float8 types quantization, but the dequantization formula remains the same
for consistency, and `x_scale` still determines the output type.

#### Version

Expand Down Expand Up @@ -25374,14 +25375,20 @@ This version of the operator has been available since version 21 of the default
The linear quantization operator consumes a high-precision tensor, a scale, and a zero point to compute the
low-precision/quantized tensor. The scale factor and zero point must have the same shape, determining the quantization
granularity. The quantization formula is `y = saturate((x / y_scale) + y_zero_point)`.
For saturation, it saturates according to:
`uint8`: `[0, 255]`, `int8`: `[-128, 127]`, `uint16`: `[0, 65535]`, `int16`: `[-32768, 32767]`, `uint4`: `[0, 15]`,
`int4`: `[-8, 7]`.

Saturation is done according to:
- uint16: [0, 65535]
- int16: [-32768, 32767]
- uint8: [0, 255]
- int8: [-128, 127]
- uint4: [0, 15]
- int4: [-8, 7]

For `(x / y_scale)`, it rounds to the nearest even. Refer to https://en.wikipedia.org/wiki/Rounding for details.
`y_zero_point` and `y` must have the same type.
`y_zero_point` is usually not used for quantization to float8e4m3fn, float8e4m3fnuz, float8e5m2, float8e5m2fnuz, but
the quantization formula remains the same for consistency, and the type of the attribute `y_zero_point` still
determines the quantization type.

`y_zero_point` and `y` must have the same type. `y_zero_point` is usually not used for quantization to float8 types, but the quantization
formula remains the same for consistency, and the type of the attribute `y_zero_point` still determines the quantization type.

There are three supported quantization granularities, determined by the shape of `y_scale`.
In all cases, `y_zero_point` must have the same shape as `y_scale`.
- Per-tensor (per-layer) quantization: `y_scale` is a scalar.
Expand Down
27 changes: 17 additions & 10 deletions docs/Operators.md
Original file line number Diff line number Diff line change
Expand Up @@ -7338,11 +7338,12 @@ expect(node, inputs=[x], outputs=[y], name="test_depthtospace_example")
full-precision tensor. The dequantization formula is `y = (x - x_zero_point) * x_scale`. `x_scale` and `x_zero_point`
must have the same shape, determining the quantization's granularity: a scalar for per-tensor/per-layer quantization,
a 1-D tensor for per-axis quantization, or have a rank identical to the input for blocked quantization.
See QuantizeLinear for details on quantization granularity."
See QuantizeLinear for details on quantization granularity.

`x_zero_point` and `x` must have the same type. `x` and `y` must have the same shape. In the case of dequantizing
`int32`, there's no zero point (zero point is supposed to be 0).
`zero-point` is usually not used in the case of float8e4m3fn, float8e4m3fnuz, float8e5m2, float8e5m2fnuz quantization,
but the dequantization formula remains the same for consistency, and `x_scale` still determines the output type.
`zero-point` is usually not used in the case of float8 types quantization, but the dequantization formula remains the same
for consistency, and `x_scale` still determines the output type.

#### Version

Expand Down Expand Up @@ -20238,14 +20239,20 @@ for quant_type_name in ["uint8", "int8"]:
The linear quantization operator consumes a high-precision tensor, a scale, and a zero point to compute the
low-precision/quantized tensor. The scale factor and zero point must have the same shape, determining the quantization
granularity. The quantization formula is `y = saturate((x / y_scale) + y_zero_point)`.
For saturation, it saturates according to:
`uint8`: `[0, 255]`, `int8`: `[-128, 127]`, `uint16`: `[0, 65535]`, `int16`: `[-32768, 32767]`, `uint4`: `[0, 15]`,
`int4`: `[-8, 7]`.

Saturation is done according to:
- uint16: [0, 65535]
- int16: [-32768, 32767]
- uint8: [0, 255]
- int8: [-128, 127]
- uint4: [0, 15]
- int4: [-8, 7]

For `(x / y_scale)`, it rounds to the nearest even. Refer to https://en.wikipedia.org/wiki/Rounding for details.
`y_zero_point` and `y` must have the same type.
`y_zero_point` is usually not used for quantization to float8e4m3fn, float8e4m3fnuz, float8e5m2, float8e5m2fnuz, but
the quantization formula remains the same for consistency, and the type of the attribute `y_zero_point` still
determines the quantization type.

`y_zero_point` and `y` must have the same type. `y_zero_point` is usually not used for quantization to float8 types, but the quantization
formula remains the same for consistency, and the type of the attribute `y_zero_point` still determines the quantization type.

There are three supported quantization granularities, determined by the shape of `y_scale`.
In all cases, `y_zero_point` must have the same shape as `y_scale`.
- Per-tensor (per-layer) quantization: `y_scale` is a scalar.
Expand Down
27 changes: 17 additions & 10 deletions onnx/defs/quantization/defs.cc
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,20 @@ static const char* QuantizeLinear_ver21_doc = R"DOC(
The linear quantization operator consumes a high-precision tensor, a scale, and a zero point to compute the
low-precision/quantized tensor. The scale factor and zero point must have the same shape, determining the quantization
granularity. The quantization formula is `y = saturate((x / y_scale) + y_zero_point)`.
For saturation, it saturates according to:
`uint8`: `[0, 255]`, `int8`: `[-128, 127]`, `uint16`: `[0, 65535]`, `int16`: `[-32768, 32767]`, `uint4`: `[0, 15]`,
`int4`: `[-8, 7]`.
Saturation is done according to:
- uint16: [0, 65535]
- int16: [-32768, 32767]
- uint8: [0, 255]
- int8: [-128, 127]
- uint4: [0, 15]
- int4: [-8, 7]
For `(x / y_scale)`, it rounds to the nearest even. Refer to https://en.wikipedia.org/wiki/Rounding for details.
`y_zero_point` and `y` must have the same type.
`y_zero_point` is usually not used for quantization to float8e4m3fn, float8e4m3fnuz, float8e5m2, float8e5m2fnuz, but
the quantization formula remains the same for consistency, and the type of the attribute `y_zero_point` still
determines the quantization type.
`y_zero_point` and `y` must have the same type. `y_zero_point` is usually not used for quantization to float8 types, but the quantization
formula remains the same for consistency, and the type of the attribute `y_zero_point` still determines the quantization type.
There are three supported quantization granularities, determined by the shape of `y_scale`.
In all cases, `y_zero_point` must have the same shape as `y_scale`.
- Per-tensor (per-layer) quantization: `y_scale` is a scalar.
Expand Down Expand Up @@ -109,11 +115,12 @@ The linear dequantization operator. It consumes a quantized tensor, a scale, and
full-precision tensor. The dequantization formula is `y = (x - x_zero_point) * x_scale`. `x_scale` and `x_zero_point`
must have the same shape, determining the quantization's granularity: a scalar for per-tensor/per-layer quantization,
a 1-D tensor for per-axis quantization, or have a rank identical to the input for blocked quantization.
See QuantizeLinear for details on quantization granularity."
See QuantizeLinear for details on quantization granularity.
`x_zero_point` and `x` must have the same type. `x` and `y` must have the same shape. In the case of dequantizing
`int32`, there's no zero point (zero point is supposed to be 0).
`zero-point` is usually not used in the case of float8e4m3fn, float8e4m3fnuz, float8e5m2, float8e5m2fnuz quantization,
but the dequantization formula remains the same for consistency, and `x_scale` still determines the output type.
`zero-point` is usually not used in the case of float8 types quantization, but the dequantization formula remains the same
for consistency, and `x_scale` still determines the output type.
)DOC";

ONNX_OPERATOR_SET_SCHEMA(
Expand Down
8 changes: 5 additions & 3 deletions onnx/onnx-ml.proto
Original file line number Diff line number Diff line change
Expand Up @@ -531,8 +531,8 @@ message TensorProto {
FLOAT8E5M2FNUZ = 20; // follows IEEE 754, supports nan, not inf, mostly used for gradients, no negative zero

// 4-bit data-types
UINT4 = 21;
INT4 = 22;
UINT4 = 21; // Unsigned integer in range [0, 15]
INT4 = 22; // Signed integer in range [-8, 7], using two's-complement representation

// Future extensions go here.
}
Expand Down Expand Up @@ -570,7 +570,8 @@ message TensorProto {
// For int32, uint8, int8, uint16, int16, uint4, int4, bool, float8 and float16 values
// float16 and float8 values must be bit-wise converted to an uint16_t prior
// to writing to the buffer.
// uint4 and int4 values must be packed to 4bitx2 prior to writing to the buffer.
// uint4 and int4 values must be packed to 4bitx2 prior to writing to the buffer, the first element is stored in
// the 4 LSB and the second element is stored in the 4 MSB.
// When this field is present, the data_type field MUST be
// INT32, INT16, INT8, INT4, UINT16, UINT8, UINT4, BOOL, FLOAT16, BFLOAT16, FLOAT8E4M3FN, FLOAT8E4M3FNUZ, FLOAT8E5M2, FLOAT8E5M2FNUZ
repeated int32 int32_data = 5 [packed = true];
Expand Down Expand Up @@ -602,6 +603,7 @@ message TensorProto {
// Complex64 elements must be written as two consecutive FLOAT values, real component first.
// Complex128 elements must be written as two consecutive DOUBLE values, real component first.
// Boolean type MUST be written one byte per tensor element (00000001 for true, 00000000 for false).
// uint4 and int4 values must be packed to 4bitx2, the first element is stored in the 4 LSB and the second element is stored in the 4 MSB.
//
// Note: the advantage of specific field rather than the raw_data field is
// that in some cases (e.g. int data), protobuf does a better packing via
Expand Down
8 changes: 5 additions & 3 deletions onnx/onnx-ml.proto3
Original file line number Diff line number Diff line change
Expand Up @@ -531,8 +531,8 @@ message TensorProto {
FLOAT8E5M2FNUZ = 20; // follows IEEE 754, supports nan, not inf, mostly used for gradients, no negative zero

// 4-bit data-types
UINT4 = 21;
INT4 = 22;
UINT4 = 21; // Unsigned integer in range [0, 15]
INT4 = 22; // Signed integer in range [-8, 7], using two's-complement representation

// Future extensions go here.
}
Expand Down Expand Up @@ -570,7 +570,8 @@ message TensorProto {
// For int32, uint8, int8, uint16, int16, uint4, int4, bool, float8 and float16 values
// float16 and float8 values must be bit-wise converted to an uint16_t prior
// to writing to the buffer.
// uint4 and int4 values must be packed to 4bitx2 prior to writing to the buffer.
// uint4 and int4 values must be packed to 4bitx2 prior to writing to the buffer, the first element is stored in
// the 4 LSB and the second element is stored in the 4 MSB.
// When this field is present, the data_type field MUST be
// INT32, INT16, INT8, INT4, UINT16, UINT8, UINT4, BOOL, FLOAT16, BFLOAT16, FLOAT8E4M3FN, FLOAT8E4M3FNUZ, FLOAT8E5M2, FLOAT8E5M2FNUZ
repeated int32 int32_data = 5 [packed = true];
Expand Down Expand Up @@ -602,6 +603,7 @@ message TensorProto {
// Complex64 elements must be written as two consecutive FLOAT values, real component first.
// Complex128 elements must be written as two consecutive DOUBLE values, real component first.
// Boolean type MUST be written one byte per tensor element (00000001 for true, 00000000 for false).
// uint4 and int4 values must be packed to 4bitx2, the first element is stored in the 4 LSB and the second element is stored in the 4 MSB.
//
// Note: the advantage of specific field rather than the raw_data field is
// that in some cases (e.g. int data), protobuf does a better packing via
Expand Down
8 changes: 5 additions & 3 deletions onnx/onnx.in.proto
Original file line number Diff line number Diff line change
Expand Up @@ -528,8 +528,8 @@ message TensorProto {
FLOAT8E5M2FNUZ = 20; // follows IEEE 754, supports nan, not inf, mostly used for gradients, no negative zero

// 4-bit data-types
UINT4 = 21;
INT4 = 22;
UINT4 = 21; // Unsigned integer in range [0, 15]
INT4 = 22; // Signed integer in range [-8, 7], using two's-complement representation

// Future extensions go here.
}
Expand Down Expand Up @@ -567,7 +567,8 @@ message TensorProto {
// For int32, uint8, int8, uint16, int16, uint4, int4, bool, float8 and float16 values
// float16 and float8 values must be bit-wise converted to an uint16_t prior
// to writing to the buffer.
// uint4 and int4 values must be packed to 4bitx2 prior to writing to the buffer.
// uint4 and int4 values must be packed to 4bitx2 prior to writing to the buffer, the first element is stored in
// the 4 LSB and the second element is stored in the 4 MSB.
// When this field is present, the data_type field MUST be
// INT32, INT16, INT8, INT4, UINT16, UINT8, UINT4, BOOL, FLOAT16, BFLOAT16, FLOAT8E4M3FN, FLOAT8E4M3FNUZ, FLOAT8E5M2, FLOAT8E5M2FNUZ
repeated int32 int32_data = 5 [packed = true];
Expand Down Expand Up @@ -599,6 +600,7 @@ message TensorProto {
// Complex64 elements must be written as two consecutive FLOAT values, real component first.
// Complex128 elements must be written as two consecutive DOUBLE values, real component first.
// Boolean type MUST be written one byte per tensor element (00000001 for true, 00000000 for false).
// uint4 and int4 values must be packed to 4bitx2, the first element is stored in the 4 LSB and the second element is stored in the 4 MSB.
//
// Note: the advantage of specific field rather than the raw_data field is
// that in some cases (e.g. int data), protobuf does a better packing via
Expand Down
8 changes: 5 additions & 3 deletions onnx/onnx.proto
Original file line number Diff line number Diff line change
Expand Up @@ -529,8 +529,8 @@ message TensorProto {
FLOAT8E5M2FNUZ = 20; // follows IEEE 754, supports nan, not inf, mostly used for gradients, no negative zero

// 4-bit data-types
UINT4 = 21;
INT4 = 22;
UINT4 = 21; // Unsigned integer in range [0, 15]
INT4 = 22; // Signed integer in range [-8, 7], using two's-complement representation

// Future extensions go here.
}
Expand Down Expand Up @@ -568,7 +568,8 @@ message TensorProto {
// For int32, uint8, int8, uint16, int16, uint4, int4, bool, float8 and float16 values
// float16 and float8 values must be bit-wise converted to an uint16_t prior
// to writing to the buffer.
// uint4 and int4 values must be packed to 4bitx2 prior to writing to the buffer.
// uint4 and int4 values must be packed to 4bitx2 prior to writing to the buffer, the first element is stored in
// the 4 LSB and the second element is stored in the 4 MSB.
// When this field is present, the data_type field MUST be
// INT32, INT16, INT8, INT4, UINT16, UINT8, UINT4, BOOL, FLOAT16, BFLOAT16, FLOAT8E4M3FN, FLOAT8E4M3FNUZ, FLOAT8E5M2, FLOAT8E5M2FNUZ
repeated int32 int32_data = 5 [packed = true];
Expand Down Expand Up @@ -600,6 +601,7 @@ message TensorProto {
// Complex64 elements must be written as two consecutive FLOAT values, real component first.
// Complex128 elements must be written as two consecutive DOUBLE values, real component first.
// Boolean type MUST be written one byte per tensor element (00000001 for true, 00000000 for false).
// uint4 and int4 values must be packed to 4bitx2, the first element is stored in the 4 LSB and the second element is stored in the 4 MSB.
//
// Note: the advantage of specific field rather than the raw_data field is
// that in some cases (e.g. int data), protobuf does a better packing via
Expand Down
8 changes: 5 additions & 3 deletions onnx/onnx.proto3
Original file line number Diff line number Diff line change
Expand Up @@ -529,8 +529,8 @@ message TensorProto {
FLOAT8E5M2FNUZ = 20; // follows IEEE 754, supports nan, not inf, mostly used for gradients, no negative zero

// 4-bit data-types
UINT4 = 21;
INT4 = 22;
UINT4 = 21; // Unsigned integer in range [0, 15]
INT4 = 22; // Signed integer in range [-8, 7], using two's-complement representation

// Future extensions go here.
}
Expand Down Expand Up @@ -568,7 +568,8 @@ message TensorProto {
// For int32, uint8, int8, uint16, int16, uint4, int4, bool, float8 and float16 values
// float16 and float8 values must be bit-wise converted to an uint16_t prior
// to writing to the buffer.
// uint4 and int4 values must be packed to 4bitx2 prior to writing to the buffer.
// uint4 and int4 values must be packed to 4bitx2 prior to writing to the buffer, the first element is stored in
// the 4 LSB and the second element is stored in the 4 MSB.
// When this field is present, the data_type field MUST be
// INT32, INT16, INT8, INT4, UINT16, UINT8, UINT4, BOOL, FLOAT16, BFLOAT16, FLOAT8E4M3FN, FLOAT8E4M3FNUZ, FLOAT8E5M2, FLOAT8E5M2FNUZ
repeated int32 int32_data = 5 [packed = true];
Expand Down Expand Up @@ -600,6 +601,7 @@ message TensorProto {
// Complex64 elements must be written as two consecutive FLOAT values, real component first.
// Complex128 elements must be written as two consecutive DOUBLE values, real component first.
// Boolean type MUST be written one byte per tensor element (00000001 for true, 00000000 for false).
// uint4 and int4 values must be packed to 4bitx2, the first element is stored in the 4 LSB and the second element is stored in the 4 MSB.
//
// Note: the advantage of specific field rather than the raw_data field is
// that in some cases (e.g. int data), protobuf does a better packing via
Expand Down

0 comments on commit 6417cb0

Please sign in to comment.