Q/DQ docs readability + 4bit info in onnx.proto (onnx#5937)

* Added line breaks in QuantizeLinear and DequantizeLinear operators documentation for readability in markdown format. * Added additional information about UINT4/INT4 in onnx.proto --------- Signed-off-by: Gal Hubara Agam <[email protected]> Co-authored-by: G. Ramalingam <[email protected]>
andife · Feb 16, 2024 · 6417cb0 · 6417cb0
1 parent 00c2f02
commit 6417cb0
Show file tree

Hide file tree

Showing 8 changed files with 76 additions and 45 deletions.
diff --git a/docs/Changelog.md b/docs/Changelog.md
@@ -24770,11 +24770,12 @@ This version of the operator has been available since version 21 of the default
   full-precision tensor. The dequantization formula is `y = (x - x_zero_point) * x_scale`. `x_scale` and `x_zero_point`
   must have the same shape, determining the quantization's granularity: a scalar for per-tensor/per-layer quantization,
   a 1-D tensor for per-axis quantization, or have a rank identical to the input for blocked quantization.
-  See QuantizeLinear for details on quantization granularity."
+  See QuantizeLinear for details on quantization granularity.
+
   `x_zero_point` and `x` must have the same type. `x` and `y` must have the same shape. In the case of dequantizing
   `int32`, there's no zero point (zero point is supposed to be 0).
-  `zero-point` is usually not used in the case of float8e4m3fn, float8e4m3fnuz, float8e5m2, float8e5m2fnuz quantization,
-  but the dequantization formula remains the same for consistency, and `x_scale` still determines the output type.
+  `zero-point` is usually not used in the case of float8 types quantization, but the dequantization formula remains the same
+  for consistency, and `x_scale` still determines the output type.
 
 #### Version
 
@@ -25374,14 +25375,20 @@ This version of the operator has been available since version 21 of the default
   The linear quantization operator consumes a high-precision tensor, a scale, and a zero point to compute the
   low-precision/quantized tensor. The scale factor and zero point must have the same shape, determining the quantization
   granularity. The quantization formula is `y = saturate((x / y_scale) + y_zero_point)`.
-  For saturation, it saturates according to:
-  `uint8`: `[0, 255]`, `int8`: `[-128, 127]`, `uint16`: `[0, 65535]`, `int16`: `[-32768, 32767]`, `uint4`: `[0, 15]`,
-  `int4`: `[-8, 7]`.
+
+  Saturation is done according to:
+  - uint16: [0, 65535]
+  - int16: [-32768, 32767]
+  - uint8: [0, 255]
+  - int8: [-128, 127]
+  - uint4: [0, 15]
+  - int4: [-8, 7]
+
   For `(x / y_scale)`, it rounds to the nearest even. Refer to https://en.wikipedia.org/wiki/Rounding for details.
-  `y_zero_point` and `y` must have the same type.
-  `y_zero_point` is usually not used for quantization to float8e4m3fn, float8e4m3fnuz, float8e5m2, float8e5m2fnuz, but
-  the quantization formula remains the same for consistency, and the type of the attribute `y_zero_point` still
-  determines the quantization type.
+
+  `y_zero_point` and `y` must have the same type. `y_zero_point` is usually not used for quantization to float8 types, but the quantization
+  formula remains the same for consistency, and the type of the attribute `y_zero_point` still determines the quantization type.
+
   There are three supported quantization granularities, determined by the shape of `y_scale`.
   In all cases, `y_zero_point` must have the same shape as `y_scale`.
   - Per-tensor (per-layer) quantization: `y_scale` is a scalar.

diff --git a/docs/Operators.md b/docs/Operators.md
@@ -7338,11 +7338,12 @@ expect(node, inputs=[x], outputs=[y], name="test_depthtospace_example")
   full-precision tensor. The dequantization formula is `y = (x - x_zero_point) * x_scale`. `x_scale` and `x_zero_point`
   must have the same shape, determining the quantization's granularity: a scalar for per-tensor/per-layer quantization,
   a 1-D tensor for per-axis quantization, or have a rank identical to the input for blocked quantization.
-  See QuantizeLinear for details on quantization granularity."
+  See QuantizeLinear for details on quantization granularity.
+
   `x_zero_point` and `x` must have the same type. `x` and `y` must have the same shape. In the case of dequantizing
   `int32`, there's no zero point (zero point is supposed to be 0).
-  `zero-point` is usually not used in the case of float8e4m3fn, float8e4m3fnuz, float8e5m2, float8e5m2fnuz quantization,
-  but the dequantization formula remains the same for consistency, and `x_scale` still determines the output type.
+  `zero-point` is usually not used in the case of float8 types quantization, but the dequantization formula remains the same
+  for consistency, and `x_scale` still determines the output type.
 
 #### Version
 
@@ -20238,14 +20239,20 @@ for quant_type_name in ["uint8", "int8"]:
   The linear quantization operator consumes a high-precision tensor, a scale, and a zero point to compute the
   low-precision/quantized tensor. The scale factor and zero point must have the same shape, determining the quantization
   granularity. The quantization formula is `y = saturate((x / y_scale) + y_zero_point)`.
-  For saturation, it saturates according to:
-  `uint8`: `[0, 255]`, `int8`: `[-128, 127]`, `uint16`: `[0, 65535]`, `int16`: `[-32768, 32767]`, `uint4`: `[0, 15]`,
-  `int4`: `[-8, 7]`.
+
+  Saturation is done according to:
+  - uint16: [0, 65535]
+  - int16: [-32768, 32767]
+  - uint8: [0, 255]
+  - int8: [-128, 127]
+  - uint4: [0, 15]
+  - int4: [-8, 7]
+
   For `(x / y_scale)`, it rounds to the nearest even. Refer to https://en.wikipedia.org/wiki/Rounding for details.
-  `y_zero_point` and `y` must have the same type.
-  `y_zero_point` is usually not used for quantization to float8e4m3fn, float8e4m3fnuz, float8e5m2, float8e5m2fnuz, but
-  the quantization formula remains the same for consistency, and the type of the attribute `y_zero_point` still
-  determines the quantization type.
+
+  `y_zero_point` and `y` must have the same type. `y_zero_point` is usually not used for quantization to float8 types, but the quantization
+  formula remains the same for consistency, and the type of the attribute `y_zero_point` still determines the quantization type.
+
   There are three supported quantization granularities, determined by the shape of `y_scale`.
   In all cases, `y_zero_point` must have the same shape as `y_scale`.
   - Per-tensor (per-layer) quantization: `y_scale` is a scalar.

diff --git a/onnx/defs/quantization/defs.cc b/onnx/defs/quantization/defs.cc
@@ -11,14 +11,20 @@ static const char* QuantizeLinear_ver21_doc = R"DOC(
 The linear quantization operator consumes a high-precision tensor, a scale, and a zero point to compute the
 low-precision/quantized tensor. The scale factor and zero point must have the same shape, determining the quantization
 granularity. The quantization formula is `y = saturate((x / y_scale) + y_zero_point)`.
-For saturation, it saturates according to:
-`uint8`: `[0, 255]`, `int8`: `[-128, 127]`, `uint16`: `[0, 65535]`, `int16`: `[-32768, 32767]`, `uint4`: `[0, 15]`,
-`int4`: `[-8, 7]`.
+
+Saturation is done according to:
+- uint16: [0, 65535]
+- int16: [-32768, 32767]
+- uint8: [0, 255]
+- int8: [-128, 127]
+- uint4: [0, 15]
+- int4: [-8, 7]
+
 For `(x / y_scale)`, it rounds to the nearest even. Refer to https://en.wikipedia.org/wiki/Rounding for details.
-`y_zero_point` and `y` must have the same type.
-`y_zero_point` is usually not used for quantization to float8e4m3fn, float8e4m3fnuz, float8e5m2, float8e5m2fnuz, but
-the quantization formula remains the same for consistency, and the type of the attribute `y_zero_point` still
-determines the quantization type.
+
+`y_zero_point` and `y` must have the same type. `y_zero_point` is usually not used for quantization to float8 types, but the quantization
+formula remains the same for consistency, and the type of the attribute `y_zero_point` still determines the quantization type.
+
 There are three supported quantization granularities, determined by the shape of `y_scale`.
 In all cases, `y_zero_point` must have the same shape as `y_scale`.
 - Per-tensor (per-layer) quantization: `y_scale` is a scalar.
@@ -109,11 +115,12 @@ The linear dequantization operator. It consumes a quantized tensor, a scale, and
 full-precision tensor. The dequantization formula is `y = (x - x_zero_point) * x_scale`. `x_scale` and `x_zero_point`
 must have the same shape, determining the quantization's granularity: a scalar for per-tensor/per-layer quantization,
 a 1-D tensor for per-axis quantization, or have a rank identical to the input for blocked quantization.
-See QuantizeLinear for details on quantization granularity."
+See QuantizeLinear for details on quantization granularity.
+
 `x_zero_point` and `x` must have the same type. `x` and `y` must have the same shape. In the case of dequantizing
 `int32`, there's no zero point (zero point is supposed to be 0).
-`zero-point` is usually not used in the case of float8e4m3fn, float8e4m3fnuz, float8e5m2, float8e5m2fnuz quantization,
-but the dequantization formula remains the same for consistency, and `x_scale` still determines the output type.
+`zero-point` is usually not used in the case of float8 types quantization, but the dequantization formula remains the same
+for consistency, and `x_scale` still determines the output type.
 )DOC";
 
 ONNX_OPERATOR_SET_SCHEMA(

diff --git a/onnx/onnx-ml.proto b/onnx/onnx-ml.proto
@@ -531,8 +531,8 @@ message TensorProto {
     FLOAT8E5M2FNUZ = 20;  // follows IEEE 754, supports nan, not inf, mostly used for gradients, no negative zero
 
     // 4-bit data-types
-    UINT4 = 21;
-    INT4 = 22;
+    UINT4 = 21;  // Unsigned integer in range [0, 15]
+    INT4 = 22;   // Signed integer in range [-8, 7], using two's-complement representation
 
     // Future extensions go here.
   }
@@ -570,7 +570,8 @@ message TensorProto {
   // For int32, uint8, int8, uint16, int16, uint4, int4, bool, float8 and float16 values
   // float16 and float8 values must be bit-wise converted to an uint16_t prior
   // to writing to the buffer.
-  // uint4 and int4 values must be packed to 4bitx2 prior to writing to the buffer.
+  // uint4 and int4 values must be packed to 4bitx2 prior to writing to the buffer, the first element is stored in
+  // the 4 LSB and the second element is stored in the 4 MSB.
   // When this field is present, the data_type field MUST be
   // INT32, INT16, INT8, INT4, UINT16, UINT8, UINT4, BOOL, FLOAT16, BFLOAT16, FLOAT8E4M3FN, FLOAT8E4M3FNUZ, FLOAT8E5M2, FLOAT8E5M2FNUZ
   repeated int32 int32_data = 5 [packed = true];
@@ -602,6 +603,7 @@ message TensorProto {
   // Complex64 elements must be written as two consecutive FLOAT values, real component first.
   // Complex128 elements must be written as two consecutive DOUBLE values, real component first.
   // Boolean type MUST be written one byte per tensor element (00000001 for true, 00000000 for false).
+  // uint4 and int4 values must be packed to 4bitx2, the first element is stored in the 4 LSB and the second element is stored in the 4 MSB.
   //
   // Note: the advantage of specific field rather than the raw_data field is
   // that in some cases (e.g. int data), protobuf does a better packing via

diff --git a/onnx/onnx-ml.proto3 b/onnx/onnx-ml.proto3
@@ -531,8 +531,8 @@ message TensorProto {
     FLOAT8E5M2FNUZ = 20;  // follows IEEE 754, supports nan, not inf, mostly used for gradients, no negative zero
 
     // 4-bit data-types
-    UINT4 = 21;
-    INT4 = 22;
+    UINT4 = 21;  // Unsigned integer in range [0, 15]
+    INT4 = 22;   // Signed integer in range [-8, 7], using two's-complement representation
 
     // Future extensions go here.
   }
@@ -570,7 +570,8 @@ message TensorProto {
   // For int32, uint8, int8, uint16, int16, uint4, int4, bool, float8 and float16 values
   // float16 and float8 values must be bit-wise converted to an uint16_t prior
   // to writing to the buffer.
-  // uint4 and int4 values must be packed to 4bitx2 prior to writing to the buffer.
+  // uint4 and int4 values must be packed to 4bitx2 prior to writing to the buffer, the first element is stored in
+  // the 4 LSB and the second element is stored in the 4 MSB.
   // When this field is present, the data_type field MUST be
   // INT32, INT16, INT8, INT4, UINT16, UINT8, UINT4, BOOL, FLOAT16, BFLOAT16, FLOAT8E4M3FN, FLOAT8E4M3FNUZ, FLOAT8E5M2, FLOAT8E5M2FNUZ
   repeated int32 int32_data = 5 [packed = true];
@@ -602,6 +603,7 @@ message TensorProto {
   // Complex64 elements must be written as two consecutive FLOAT values, real component first.
   // Complex128 elements must be written as two consecutive DOUBLE values, real component first.
   // Boolean type MUST be written one byte per tensor element (00000001 for true, 00000000 for false).
+  // uint4 and int4 values must be packed to 4bitx2, the first element is stored in the 4 LSB and the second element is stored in the 4 MSB.
   //
   // Note: the advantage of specific field rather than the raw_data field is
   // that in some cases (e.g. int data), protobuf does a better packing via

diff --git a/onnx/onnx.in.proto b/onnx/onnx.in.proto
@@ -528,8 +528,8 @@ message TensorProto {
     FLOAT8E5M2FNUZ = 20;  // follows IEEE 754, supports nan, not inf, mostly used for gradients, no negative zero
 
     // 4-bit data-types
-    UINT4 = 21;
-    INT4 = 22;
+    UINT4 = 21;  // Unsigned integer in range [0, 15]
+    INT4 = 22;   // Signed integer in range [-8, 7], using two's-complement representation
 
     // Future extensions go here.
   }
@@ -567,7 +567,8 @@ message TensorProto {
   // For int32, uint8, int8, uint16, int16, uint4, int4, bool, float8 and float16 values
   // float16 and float8 values must be bit-wise converted to an uint16_t prior
   // to writing to the buffer.
-  // uint4 and int4 values must be packed to 4bitx2 prior to writing to the buffer.
+  // uint4 and int4 values must be packed to 4bitx2 prior to writing to the buffer, the first element is stored in
+  // the 4 LSB and the second element is stored in the 4 MSB.
   // When this field is present, the data_type field MUST be
   // INT32, INT16, INT8, INT4, UINT16, UINT8, UINT4, BOOL, FLOAT16, BFLOAT16, FLOAT8E4M3FN, FLOAT8E4M3FNUZ, FLOAT8E5M2, FLOAT8E5M2FNUZ
   repeated int32 int32_data = 5 [packed = true];
@@ -599,6 +600,7 @@ message TensorProto {
   // Complex64 elements must be written as two consecutive FLOAT values, real component first.
   // Complex128 elements must be written as two consecutive DOUBLE values, real component first.
   // Boolean type MUST be written one byte per tensor element (00000001 for true, 00000000 for false).
+  // uint4 and int4 values must be packed to 4bitx2, the first element is stored in the 4 LSB and the second element is stored in the 4 MSB.
   //
   // Note: the advantage of specific field rather than the raw_data field is
   // that in some cases (e.g. int data), protobuf does a better packing via

diff --git a/onnx/onnx.proto b/onnx/onnx.proto
@@ -529,8 +529,8 @@ message TensorProto {
     FLOAT8E5M2FNUZ = 20;  // follows IEEE 754, supports nan, not inf, mostly used for gradients, no negative zero
 
     // 4-bit data-types
-    UINT4 = 21;
-    INT4 = 22;
+    UINT4 = 21;  // Unsigned integer in range [0, 15]
+    INT4 = 22;   // Signed integer in range [-8, 7], using two's-complement representation
 
     // Future extensions go here.
   }
@@ -568,7 +568,8 @@ message TensorProto {
   // For int32, uint8, int8, uint16, int16, uint4, int4, bool, float8 and float16 values
   // float16 and float8 values must be bit-wise converted to an uint16_t prior
   // to writing to the buffer.
-  // uint4 and int4 values must be packed to 4bitx2 prior to writing to the buffer.
+  // uint4 and int4 values must be packed to 4bitx2 prior to writing to the buffer, the first element is stored in
+  // the 4 LSB and the second element is stored in the 4 MSB.
   // When this field is present, the data_type field MUST be
   // INT32, INT16, INT8, INT4, UINT16, UINT8, UINT4, BOOL, FLOAT16, BFLOAT16, FLOAT8E4M3FN, FLOAT8E4M3FNUZ, FLOAT8E5M2, FLOAT8E5M2FNUZ
   repeated int32 int32_data = 5 [packed = true];
@@ -600,6 +601,7 @@ message TensorProto {
   // Complex64 elements must be written as two consecutive FLOAT values, real component first.
   // Complex128 elements must be written as two consecutive DOUBLE values, real component first.
   // Boolean type MUST be written one byte per tensor element (00000001 for true, 00000000 for false).
+  // uint4 and int4 values must be packed to 4bitx2, the first element is stored in the 4 LSB and the second element is stored in the 4 MSB.
   //
   // Note: the advantage of specific field rather than the raw_data field is
   // that in some cases (e.g. int data), protobuf does a better packing via

diff --git a/onnx/onnx.proto3 b/onnx/onnx.proto3
@@ -529,8 +529,8 @@ message TensorProto {
     FLOAT8E5M2FNUZ = 20;  // follows IEEE 754, supports nan, not inf, mostly used for gradients, no negative zero
 
     // 4-bit data-types
-    UINT4 = 21;
-    INT4 = 22;
+    UINT4 = 21;  // Unsigned integer in range [0, 15]
+    INT4 = 22;   // Signed integer in range [-8, 7], using two's-complement representation
 
     // Future extensions go here.
   }
@@ -568,7 +568,8 @@ message TensorProto {
   // For int32, uint8, int8, uint16, int16, uint4, int4, bool, float8 and float16 values
   // float16 and float8 values must be bit-wise converted to an uint16_t prior
   // to writing to the buffer.
-  // uint4 and int4 values must be packed to 4bitx2 prior to writing to the buffer.
+  // uint4 and int4 values must be packed to 4bitx2 prior to writing to the buffer, the first element is stored in
+  // the 4 LSB and the second element is stored in the 4 MSB.
   // When this field is present, the data_type field MUST be
   // INT32, INT16, INT8, INT4, UINT16, UINT8, UINT4, BOOL, FLOAT16, BFLOAT16, FLOAT8E4M3FN, FLOAT8E4M3FNUZ, FLOAT8E5M2, FLOAT8E5M2FNUZ
   repeated int32 int32_data = 5 [packed = true];
@@ -600,6 +601,7 @@ message TensorProto {
   // Complex64 elements must be written as two consecutive FLOAT values, real component first.
   // Complex128 elements must be written as two consecutive DOUBLE values, real component first.
   // Boolean type MUST be written one byte per tensor element (00000001 for true, 00000000 for false).
+  // uint4 and int4 values must be packed to 4bitx2, the first element is stored in the 4 LSB and the second element is stored in the 4 MSB.
   //
   // Note: the advantage of specific field rather than the raw_data field is
   // that in some cases (e.g. int data), protobuf does a better packing via