You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The problematic part that needs additional work includes:
math mismatch: whether addition is performed first or division(multiplication) performed first
type mismatch: cpu version use int64 and gpu version use float point
The follow up item for this op is to hold 3 way meeting pulling in CK team and make sure they use consistent math to align up expectations.
Quant Convolution and Gemm
Those two ops point at a graph level trick that would allow the convolution to be executed in lower precision. In particular in the simplify_qdq pass, it points to the transformation that would convert a conventional convolution into:
quantized filter and input
quantized conv
dequantized output
The fact that the subgraph follows the pattern of a low precision data type convolution followed by dequantization make it hard to fit into our target pattern: int8 convolution followed by a quantization.
However, it looks like the non-quantized version of convolution can still be applied to this flow. Which means that:
Precondition: fp16/fp32 to int8 quantized convolution
First op: conv with int8 input and int32 output
Second op: int32 dequantized to fp16 or fp32
Implicitly, this call for the requirement that rocMLIR team should not discard the "regular" int8 -> int32 convolution prototype because it can still be useful in quantized convolution scenarios.
Our Integration approach
Alternative 1: high level op lowering
Directly use the high level Quantize linear Op lowering. Create a MIGraphX equivalent representation of it then lower the entire piece in rocMLIR.
Alternative 2:
Add migraphX Clip + Round + Convert lowering. Then the quantized implementation "naturally" come as a lowering result
Comparison:
Pro of alternative 1:
We have much more control of implementation detail. Client "has to" conform to our implementation
We have better consistency between different clients: MIOpen and MIGraphX integration will take exactly the same approach
Can create precise 1:1 mapping in the high level Op compared with client
Pro of alternative 2:
We fit in only the missing pieces and leverage the existing MIGraphX pipeline as much as possible
For convert and round op don't know how to address the duplicate yet.
Better extendibility with MIGraphX integration.
Weighing on the pros and cons I decide to implement 1st approach to start with. In addition, we need to initiate meeting to enforce the convergence of different implementations between different clients.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I've been researching through MIGraphX code base and used this discussion as proposal.
Relevant high level ops
MIGraphX::Round + Clip + Convert, or Tosa::Cast
RoundOp implementation, here
ConvertOp implementation, here
ClipOp implementation, here
The similarity between two abstractions is that between TOSA and MIGraphX, ClipOp is ClampOp are equivalent.
The difference between two abstractions are:
Tosa::Cast
assumes the clipping range to be the natural range of the type. For int8, it will assume to clip at the range between[-128, 127]
.Therefore, the gap between tosa and migraphx are new representation of:
MIGraphX::ClipOp
->mlir::tosa::ClampOp
MIGraphX::RoundOp
-> New Op orlinalg
!!MIGraphX::ConvertOp
-> New Op orlinalg
!!mlir::tosa::CastOp
would not be usable because of lack of similar mapping opsIn addition, we may need to double check on whether
Clip
+Round
+Convert
get rewritten to tosa cast when they the sequence fit.Quantize Linear Op
Reference implementation: here.
This is existing MIGraphX's view of the math behind quantization. There's a division followed by multiplication then a clamp.
The corresponding gpu rewrite is here.
The result of rewrite is roughly:
The problematic part that needs additional work includes:
The follow up item for this op is to hold 3 way meeting pulling in CK team and make sure they use consistent math to align up expectations.
Quant Convolution and Gemm
Those two ops point at a graph level trick that would allow the convolution to be executed in lower precision. In particular in the
simplify_qdq
pass, it points to the transformation that would convert a conventional convolution into:The fact that the subgraph follows the pattern of a low precision data type convolution followed by dequantization make it hard to fit into our target pattern: int8 convolution followed by a quantization.
However, it looks like the non-quantized version of convolution can still be applied to this flow. Which means that:
Implicitly, this call for the requirement that rocMLIR team should not discard the "regular" int8 -> int32 convolution prototype because it can still be useful in quantized convolution scenarios.
Our Integration approach
Alternative 1: high level op lowering
Directly use the high level Quantize linear Op lowering. Create a MIGraphX equivalent representation of it then lower the entire piece in rocMLIR.
Alternative 2:
Add migraphX Clip + Round + Convert lowering. Then the quantized implementation "naturally" come as a lowering result
Comparison:
Pro of alternative 1:
Pro of alternative 2:
Weighing on the pros and cons I decide to implement 1st approach to start with. In addition, we need to initiate meeting to enforce the convergence of different implementations between different clients.
Beta Was this translation helpful? Give feedback.
All reactions