Quantization ops integration with MIGraphX #981

jerryyin · 2023-02-17T20:12:39Z

jerryyin
Feb 17, 2023
Maintainer

I've been researching through MIGraphX code base and used this discussion as proposal.

Relevant high level ops

MIGraphX::Round + Clip + Convert, or Tosa::Cast

RoundOp implementation, here

return [](auto x) { return std::round(x); };

ConvertOp implementation, here

shape::visit(type, [&](auto as) { y = std::min(std::max(as(x), as.min()), as.max()); });

ClipOp implementation, here

[&](auto i) { output[i] = std::min(std::max(min[i], x[i]), max[i]); });

The similarity between two abstractions is that between TOSA and MIGraphX, ClipOp is ClampOp are equivalent.

The difference between two abstractions are:

Can't find a corresponding abstraction from MIGraphX::Round.
- The corresponding semantics was implicitly captured in tosa::CastOp when there's a fp to int downcast
- [TODO] @sjw36 pointed out to experiment with tosa::RescaleOp
tosa::Cast behavior is different than MIGraphX::Convert
- tosa::Cast does Round + Clamp + type conversion
- MIGraphX::Convert does type conversion + Clip
- Tosa::Cast assumes the clipping range to be the natural range of the type. For int8, it will assume to clip at the range between [-128, 127].
- MIGraphX::Convert is capable of clip based on passed in minimal and maximal value.

Therefore, the gap between tosa and migraphx are new representation of:

MIGraphX::ClipOp -> mlir::tosa::ClampOp
MIGraphX::RoundOp -> New Op or linalg!!
MIGraphX::ConvertOp -> New Op or linalg!!
mlir::tosa::CastOp would not be usable because of lack of similar mapping ops

In addition, we may need to double check on whether Clip + Round + Convert get rewritten to tosa cast when they the sequence fit.

Quantize Linear Op

Reference implementation: here.

int64_t quantized = static_cast<int64_t>(std::round(input[i] / scales[i])) +
                    static_cast<int64_t>(zero_pts[i]);
output[i] = std::max(static_cast<int64_t>(min_value),
                     std::min(static_cast<int64_t>(max_value), quantized));

This is existing MIGraphX's view of the math behind quantization. There's a division followed by multiplication then a clamp.

The corresponding gpu rewrite is here.

The result of rewrite is roughly:

divsion op (with scale)
Addition op (with zero point)
clip op
Convert op
Round Op

The problematic part that needs additional work includes:

math mismatch: whether addition is performed first or division(multiplication) performed first
type mismatch: cpu version use int64 and gpu version use float point

The follow up item for this op is to hold 3 way meeting pulling in CK team and make sure they use consistent math to align up expectations.

Quant Convolution and Gemm

Those two ops point at a graph level trick that would allow the convolution to be executed in lower precision. In particular in the simplify_qdq pass, it points to the transformation that would convert a conventional convolution into:

quantized filter and input
quantized conv
dequantized output

The fact that the subgraph follows the pattern of a low precision data type convolution followed by dequantization make it hard to fit into our target pattern: int8 convolution followed by a quantization.

However, it looks like the non-quantized version of convolution can still be applied to this flow. Which means that:

Precondition: fp16/fp32 to int8 quantized convolution
First op: conv with int8 input and int32 output
Second op: int32 dequantized to fp16 or fp32

Implicitly, this call for the requirement that rocMLIR team should not discard the "regular" int8 -> int32 convolution prototype because it can still be useful in quantized convolution scenarios.

Our Integration approach

Alternative 1: high level op lowering

Directly use the high level Quantize linear Op lowering. Create a MIGraphX equivalent representation of it then lower the entire piece in rocMLIR.

Alternative 2:

Add migraphX Clip + Round + Convert lowering. Then the quantized implementation "naturally" come as a lowering result

Comparison:

Pro of alternative 1:

We have much more control of implementation detail. Client "has to" conform to our implementation
We have better consistency between different clients: MIOpen and MIGraphX integration will take exactly the same approach
Can create precise 1:1 mapping in the high level Op compared with client

Pro of alternative 2:

We fit in only the missing pieces and leverage the existing MIGraphX pipeline as much as possible
- For convert and round op don't know how to address the duplicate yet.
Better extendibility with MIGraphX integration.

Weighing on the pros and cons I decide to implement 1st approach to start with. In addition, we need to initiate meeting to enforce the convergence of different implementations between different clients.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantization ops integration with MIGraphX #981

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Quantization ops integration with MIGraphX #981

jerryyin Feb 17, 2023 Maintainer

Relevant high level ops

MIGraphX::Round + Clip + Convert, or Tosa::Cast

Quantize Linear Op

Quant Convolution and Gemm

Our Integration approach

Alternative 1: high level op lowering

Alternative 2:

Comparison:

Replies: 0 comments

jerryyin
Feb 17, 2023
Maintainer