[RFC] Microscaling (MX) types in XLA #18085

sergey-kozub · 2024-10-09T11:02:57Z

sergey-kozub
Oct 9, 2024
Collaborator

RFC: Microscaling (MX) types in XLA

Overview

Open Compute Project (OCP) proposed Microscaling Formats (MX) Specification v1.0 in September 2023. It defines floating-point formats such as MXFP8, MXFP6 and MXFP4.

This RFC proposes to add new primitive types that will allow implementing MX formats in XLA (using tuple types).

Summary

MX floating point formats

The MX specification defines a way to represent block scaled data using three components: private elements, scaling factors and block size.

Concrete MX-compliant formats:

Format name	Element type	Scale type	Block size
MXFP8	E5M2 or E4M3	E8M0	32
MXFP6	E3M2 or E2M3	E8M0	32
MXFP4	E2M1	E8M0	32

The type names used in the MX specification correspond to the following XLA primitive types:

E5M2 -> F8E5M2
E4M3 -> F8E4M3FN
E3M2 -> F6E3M2FN (proposed)
E2M3 -> F6E2M3FN (proposed)
E2M1 -> F4E2M1FN (proposed)
E8M0 -> F8E8M0FNU (proposed)

The FN type suffix denotes types that can represent finite values only (F) and have a special NaN encoding (N).

Important to note: XLA has both F8E4M3 and F8E4M3FN primitive types, which have different semantics. The MX spec uses the latter for the MXFP8 format.

New XLA primitive types

The primitive types necessary to implement MX floating-point formats in XLA were added to LLVM APFloat¹ ² ³, to MLIR⁴ ⁵ ⁶ ⁷ and to JAX-ML⁸ ⁹, the latter also makes them available in NumPy. StableHLO RFC¹⁰ is in review.

F8E8M0FNU

8-bit floating point type with no sign bit, 8 bits exponent and no mantissa.

Maximum value: 2^127
Minimum value: 2^(-127)
Doesn't have infinity
Doesn't have zeros

This type cannot encode negative values or zeros, but it's only intended to be used for scaling factors where such values are not needed.

F6E3M2FN

6-bit floating point type with 1 sign bit, 3 bits exponent and 2 bits mantissa.

Maximum absolute value: 28
Minimum absolute value: 0.0625
Doesn't have infinity or NaN
Has positive and negative zeros

F6E2M3FN

6-bit floating point type with 1 sign bit, 2 bits exponent and 3 bits mantissa.

Maximum absolute value: 7.5
Minimum absolute value: 0.125
Doesn't have infinity or NaN
Has positive and negative zeros

F4E2M1FN

4-bit floating point type with 1 sign bit, 2 bits exponent and 1 bit mantissa.

Maximum absolute value: 6
Minimum absolute value: 0.5
Doesn't have infinity or NaN
Has positive and negative zeros

Composite types

MXFP8 data could be conceptually represented in XLA as a tuple type, e.g. (f8e4m3fn[…,N], f8e8m0fnu[…,N/32]), once the proposed primitive types are added. Similarly for MXFP6 and MXFP4.

A possible alternative is to add MXFP8 to the list of primitive types, but it's not primitive. This would add a lot of tech debt for no good reason.

Memory layout

F4E2M1FN tensors could be packed similarly to U4 tensors, where every byte stores two values. We can piggyback on the existing implementation for loads and stores.

6-bit types (F6E2M3FN and F6E3M2FN) could be packed in a way where every three bytes store four values. An alternative memory layout for sub-byte types is described in eXmY paper¹¹, which could be used for 6-bit types, but this is out of scope of this RFC.

F8E8M0FNU tensors do not require special memory layout, and could be implemented similarly to the other FP8 types.

Type conversion

HLO convert op

HLO convert op will be updated to support the new primitive types. The conversion will be done using RN (round to nearest even) rounding mode, similarly to the other XLA floating point type conversions.

XLA currently has two implementations of type conversion lowering: ElementalIrEmitter (used on CPU) and ExpandFloatOpsPass (used on GPU) - both will need to be updated.

During the conversion from IEEE-754 types, infinities and exponent overflows will be clamped to the maximum absolute value (preserving the sign).

The MX specification doesn't define how NaN values should be encoded for the types that don't support NaN (F4E2M1FN, F6E2M3FN, F6E3M2FN). Two possible options are either to use negative zero value, or to use maximum absolute value (preserving the sign).

When converting a signed type to the F8E8M0FNU type, the sign will be ignored. There's a shortcut for converting from FP32 or BF16 types in RZ (round to zero) rounding mode - right shift to keep the exponent only.

Dequantization

In order to convert an MX format (a tuple of element and scaling tensors) to a wider type (e.g. FP16), one should upcast and multiply the element tensor by the broadcasted scaling tensor.

MLIR example of converting MXFP8 format to FP16:

func.func @dequantize(%x : tensor<128xf8E5M2>, %x_scale : tensor<4xf8E8M0FNU>) -> tensor<128xf16> {
  // Step 1: Convert both tensors to FP16.
  %x_f16 = stablehlo.convert %x : (tensor<128xf8E5M2>) -> tensor<128xf16>
  %x_scale_f16 = stablehlo.convert %x : (tensor<4xf8E8M0FNU>) -> tensor<4xf16>

  // Step 2: Broadcast and reshape scale tensor.
  %x_scale_f16_broadcast = stablehlo.broadcast_in_dim %x_scale_f16, dims = [0] : (tensor<4xf16>) -> tensor<4x32xf16>
  %x_scale_f16_reshape = stablehlo.reshape %x_scale_f16_broadcast : (tensor<4x32xf16>) -> tensor<128xf16>

  // Step 3: Multiply the tensors.
  %result = stablehlo.multiply %x_f16, %x_scale_f16_reshape : tensor<128xf16>
  return %result : tensor<128xf16>
}

Quantization

Conversion of a floating point tensor to an MX format is described in the MX specification (section 6.3).

MLIR example of converting FP32 to MXFP8 format:

func.func @quantize(%x : tensor<128xf32>) -> (tensor<128xf8E4M3FN>, tensor<4xf8E8M0FNU>) {
  // Step 1: Reshape input into blocks.
  %x_blocks = stablehlo.reshape %x : (tensor<128xf32>) -> tensor<4x32xf32>

  // Step 2: Find absolute maximum (amax) for each block.
  %init = stablehlo.constant dense<0.0> : tensor<f32>
  %amax = "stablehlo.reduce"(%x_blocks, %init) ({
    ^bb0(%arg0: tensor<f32>, %arg1: tensor<f32>):
      %0 = stablehlo.abs %arg1 : tensor<f32>
      %1 = stablehlo.maximum %arg0, %0 : tensor<f32>
      stablehlo.return %1 : tensor<f32>
  }) {
    dimensions = array<i64: 1>
  } : (tensor<4x32xf32>, tensor<f32>) -> tensor<4xf32>

  // Step 3: Calculate largest power-of-two less than or equal to amax.
  // This can be done by zeroing out the mantissa bits.
  %amax_i32 = stablehlo.bitcast_convert %amax : (tensor<4xf32>) -> tensor<4xi32>
  %exponent_mask = stablehlo.constant dense<0x7F800000> : tensor<i32>
  %mask_broadcast = stablehlo.broadcast_in_dim %exponent_mask, dims = [] : (tensor<i32>) -> tensor<4xi32>
  %exponent_bits = stablehlo.and %amax_i32, %mask_broadcast : tensor<4xi32>
  %amax_rz = stablehlo.bitcast_convert %amax : (tensor<4xi32>) -> tensor<4xf32>

  // Step 4: Divide by the largest power-of-two representable by the type.
  // This can also be done by multiplying by the reciprocal.
  %e4m3fn_emax = stablehlo.constant dense<256.0> : tensor<f32>
  %emax_broadcast = stablehlo.broadcast_in_dim %e4m3fn_emax, dims = [] : (tensor<f32>) -> tensor<4xf32>
  %scale = stablehlo.divide %amax_rz, %emax_broadcast : tensor<4xf32>

  // Step 5: Divide the block elements by the calculated scale.
  %scale_broadcast = stablehlo.broadcast_in_dim %scale, dims = [0] : (tensor<4xf32>) -> tensor<4x32xf32>
  %x_blocks_scaled = stablehlo.divide %x_blocks, %scale_broadcast : tensor<4x32xf32>

  // Step 6: Reshape and convert to the quantization type.
  %x_scaled = stablehlo.reshape %x_blocks_scaled : (tensor<4x32xf32>) -> tensor<128xf32>
  %x_f8 = stablehlo.convert %x_scaled : (tensor<128xf32>) -> tensor<128xf8E4M3FN>

  // Step 7: Convert scale to the scaling type.
  // This can also be done by bit shift, as the mantissa bits are zero.
  %scale_f8 = stablehlo.convert %scale : (tensor<4xf32>) -> tensor<4xf8E8M0FNU>
  return (%x_f8, %scale_f8) : (tensor<128xf8E4M3FN>, tensor<4xf8E8M0FNU>)
}

The quantized tensor may contain non-finite values due to conversion overflow in MXFP8 - these should be replaced by the absolute maximum value (as saturating conversion is not available in StableHLO). This doesn't happen with MXFP6 and MXFP4, as their element types are finite-only.

HLO ops

Arithmetic ops

We can support the arithmetic ops on new primitive types in a way similar to FP8 by using the FloatNormalization compiler pass to upcast the smaller type to FP16, perform the operation and downcast back. Adjacent convert ops should be eliminated by the SimplifyFPConversions compiler pass.

As the MX formats will be represented by tuple types in XLA, doing any arithmetics on such composite types would require explicit quantization and dequantization around the arithmetic ops.

Scaled dot op

A scaled dot op (doesn't exist in HLO as of today) could accept a block scaled format for either LHS or RHS input (or both). This means it would have three or four tensor parameters instead of two.

I propose to use a custom call for representing a scaled dot op, until it is no longer experimental - at that point we could introduce a new HLO op.

Analysis

Microscaling formats

For MXFP8, a block of data takes 33 bytes of memory, a 48% reduction in size compared to FP16, and a 3% overhead compared to FP8. Q/DQ mean relative error is ~4.7% with F8E5M2 and ~2.4% with F8E4M3FN types.

For MXFP6, a block of data takes 25 bytes of memory, a 61% reduction in size compared to FP16, and a 22% reduction compared to FP8. Q/DQ mean relative error is ~5.0% with both F6E3M2FN and F6E2M3FN types.

For MXFP4, a block of data takes 17 bytes of memory, a 73% reduction in size compared to FP16, and a 47% reduction compared to FP8. Q/DQ mean relative error is ~16%.

The Q/DQ mean relative error was calculated by converting an uniform distribution of FP32 values to the MX format and back (quantize followed by dequantize, as specified above) and aggregating the absolute delta divided by the absolute value. Different value distributions could yield different results.

Comparison to FP8

FP8 tensors in XLA have an accompanying scaling factor scalar¹², which is used to dequantize the data (implicitly in the case of dot operation). This has a few implications:

If the input data has outliers (e.g. a normal distribution), then the majority of the values will have their quantized accuracy reduced compared to a block scaled format, where such outliers would only affect the accuracy of their block.
To compute the tensor scaling factor, a tensor-wide reduction is necessary - this results in an extra collective operation, which could be slow in multi-host setups. With block scaled formats this could be avoided.

StableHLO

StableHLO RFC¹⁰ proposes adding the MX floating point primitive types, this is a requirement for adding these primitive types to XLA.

The MX floating point formats will be represented similarly in StableHLO - using the tuple composite types. The existing quantization types in StableHLO are integer and cannot represent MX block formats.

LLVM PR#95392 [APFloat] Add APFloat support for FP4 data type ↩
LLVM PR#94735 [APFloat] Add APFloat support for FP6 data types ↩
LLVM PR#107127 [APFloat] Add APFloat support for E8M0 type ↩
LLVM PR#108877 [MLIR] Add f4E2M1FN type ↩
LLVM PR#107999 [MLIR] Add f6E2M3FN type ↩
LLVM PR#105573 [MLIR] Add f6E3M2FN type ↩
LLVM PR#111028 [MLIR] Add f8E8M0FNU type ↩
JAX-ML PR#181 Add sub-byte data types: float4_e2m1fn, float6_e2m3fn, float6_e3m2fn ↩
JAX-ML PR#166 Add float8_e8m0_fnu (E8M0) OCP MX scale format ↩
StableHLO PR#2581 [RFC] Microscaling data types (f4E2M1FN, f6E2M3FN, f6E3M2FN, f8E8M0FNU) ↩ ↩²
eXmY: A Data Type and Technique for Arbitrary Bit Precision Quantization ↩
RFC: FP8 in XLA ↩

GleasonK · 2024-10-09T14:48:36Z

GleasonK
Oct 9, 2024
Maintainer

As the MX formats will be represented by tuple types in XLA, doing any arithmetics on such composite types would require explicit quantization and dequantization around the arithmetic ops.

Does this just mean that there will be no MX datatypes in HLO since they'll be represented by the underlying math? Or would this proposal include specifying arithmetic ops on tuple types? I.e. at what point is the tuple needed?

The MX floating point formats will be represented similarly in StableHLO - using the tuple composite types.

Would like to hear more on this / see an example of an MX datatype program. Tuples have very little use in StableHLO today and are mostly deprecated as a relic of HLO.

Also cc @sdasgup3 to see if this numeric format should have any ties to the quantized type. Seems plausible that they could be. If XLA will require only the underlying math, we could handle this similarly to existing quantization where StableHLO has a more static representation and a pass to decompose to the underlying Q/DQ math.

4 replies

sergey-kozub Oct 9, 2024
Collaborator Author

Does this just mean that there will be no MX datatypes?

No, it doesn't mean that. We could introduce the "MXFP8" datatype at a later stage. However, my point here is that we shouldn't add MX formats to primitive types, we could have a concept of composite types instead.

Would this proposal include specifying arithmetic ops on tuple types?

No, I don't intend to do that. This RFC is about adding primitive types - this would enable representing MX formats using tuples.

The main point is to enable experimenting with MX formats in JAX and build models based on e.g. MXFP8 or MXFP4. If this proves to be useful, we could build the StableHLO infrastructure around it - such as adding composite MX types (instead of tuples), adding block scaled dot op (instead of custom call), adding block scaling support to the quantization abstraction, etc.

Tuples have very little use in StableHLO today.

This is actually a good example how tuples can be useful in representing non-uniform data - an MX format is essentially a tuple of (elements, scale) tensors with different shapes. However, tuples lack validations, so a specific type could be a better fit.

Would like to hear more on this / see an example of an MX datatype program.

One example would be a LLM where some (not all) layers could have weights encoded as MXFP4. Some research shows that the model quality degradation could be tolerable (see the eXmY paper, table 3). The only operations needed for such a model are quantize, dequantize and block scaled dot (which can be decomposed into dequantize followed by a generic dot in the same fusion).

GleasonK Oct 11, 2024
Maintainer

Most of this makes sense! And agree that these would be better represented as explicit MLIR types for validation.

Would like to hear more on this / see an example of an MX datatype program.

I guess I meant I'd like to see what a program would look like at the IR level, but I think that you've done that and I misunderstood:

return (%x_f8, %scale_f8) : (tensor<128xf8E4M3FN>, tensor<4xf8E8M0FNU>)

IIUC these two values together indicate a pseudo-MXFP8 type, and we're pushing it into the framework to use these values to do "MX style computations" properly. This makes a lot of sense, especially if the goal is enabling the exploration. Thanks!

What I was trying to clarify is if this RFC was additionally proposing IR expressivity for:

add %mxfp8_A, %mxfp8_B : tuple<tensor<128xf8E4M3FN>, tensor<4xf8E8M0FNU>>

And it sounds like that's not the case, at least not yet, and that SGTM!

balancap Oct 17, 2024

Please correct me if I am misunderstanding: mu understanding is this RFC would only extend what is already existing for FP8 support (even though not fully formalized as a scaled_dot_general custom op).

i.e. more explicitely, at the moment, when writing this piece of FP8 code in JAX

def matmul_fn_with_scale(a_fp8, b_fp8, a_scale, b_scale, d_scale):
    # Dequantize x and y
    a_dqt = a_fp8.astype(dqt_dtype) * a_scale.astype(dqt_dtype)
    b_dqt = b_fp8.astype(dqt_dtype) * b_scale.astype(dqt_dtype)
    
    # Do the matmul (NOTE: adding transpose to reduce on last axis).
    d_dqt = jax.lax.dot(a_dqt, b_dqt.T)
    
    # Rescale & clamp to -max/+max FP8 E4M3 values.
    d_dqt = d_dqt * d_scale.astype(dqt_dtype)
    # NOTE: clamping is NOT optional for proper pattern matching!
    d_dqt = jax.lax.clamp(dqt_dtype(-e4m3_max), d_dqt, dqt_dtype(e4m3_max))
    # (Re)Quantize the scaled matmul output.
    return d_dqt.astype(jnp.float8_e4m3fn)

it generates the following fused HLO custom op call:

ENTRY %main.25 (Arg_0.1.0: f8e4m3fn[32,64], Arg_1.2.0: f8e4m3fn[128,64], Arg_2.3.0: f32[], Arg_3.4.0: f32[], Arg_4.5.0: f32[]) -> f8e4m3fn[32,128] {
  %constant_1 = f32[] constant(1)
  %Arg_4.5.0 = f32[] parameter(4)
  %Arg_3.4.0 = f32[] parameter(3)
  %Arg_2.3.0 = f32[] parameter(2)
  %Arg_1.2.0 = f8e4m3fn[128,64]{1,0} parameter(1)
  %Arg_0.1.0 = f8e4m3fn[32,64]{1,0} parameter(0)
  %cublas-gemm.clone.1.0 = (f8e4m3fn[32,128]{1,0}, s8[33554432]{0}) custom-call(f8e4m3fn[32,64]{1,0} %Arg_0.1.0, f8e4m3fn[128,64]{1,0} %Arg_1.2.0, f32[] %Arg_2.3.0, f32[] %Arg_3.4.0, f32[] %constant_1, /*index=5*/f32[] %Arg_4.5.0), custom_call_target="__cublas$lt$matmul$f8"
  ROOT %get-tuple-element.1 = f8e4m3fn[32,128]{1,0} get-tuple-element((f8e4m3fn[32,128]{1,0}, s8[33554432]{0}) %cublas-gemm.clone.1.0), index=0
}

which despite being called __cublas$lt$matmul$f8, is actually in practice a scaled dot product operation.

In an MX version of the example above, a_scale/b_scale/c_scale would be replaced by E8M0 tensors (with proper shapes), a_fp8/b_fp8 would FP4/FP6/FP8 tensors and the HLO custom call would be to the appropriate MX matmul function (platform specific).

Note: FP8 documentation for reproduction https://github.com/graphcore-research/jax-scalify/blob/main/docs/JAX%20FP8%20matmul%20tutorial.ipynb

sergey-kozub Oct 17, 2024
Collaborator Author

In an MX version of the example above, a_scale/b_scale/c_scale would be replaced by E8M0 tensors (with proper shapes), a_fp8/b_fp8 would FP4/FP6/FP8 tensors and the HLO custom call would be to the appropriate MX matmul function (platform specific).

Yes, this is how it is expected to work. Note that multiple backends may support block scaled matmuls in the future (e.g. cudnn, triton), that could allow fusing it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Microscaling (MX) types in XLA #18085

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

[RFC] Microscaling (MX) types in XLA #18085

sergey-kozub Oct 9, 2024 Collaborator

RFC: Microscaling (MX) types in XLA

Overview

Summary

MX floating point formats

New XLA primitive types

F8E8M0FNU

F6E3M2FN

F6E2M3FN

F4E2M1FN

Composite types

Memory layout

Type conversion

HLO convert op

Dequantization

Quantization

HLO ops

Arithmetic ops

Scaled dot op

Analysis

Microscaling formats

Comparison to FP8

StableHLO

Footnotes

Replies: 1 comment · 4 replies

GleasonK Oct 9, 2024 Maintainer

sergey-kozub Oct 9, 2024 Collaborator Author

GleasonK Oct 11, 2024 Maintainer

balancap Oct 17, 2024

sergey-kozub Oct 17, 2024 Collaborator Author

sergey-kozub
Oct 9, 2024
Collaborator

Replies: 1 comment 4 replies

GleasonK
Oct 9, 2024
Maintainer

sergey-kozub Oct 9, 2024
Collaborator Author

GleasonK Oct 11, 2024
Maintainer

sergey-kozub Oct 17, 2024
Collaborator Author