-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DynamicQuantizeLinear op #2489
Conversation
a824b76
to
40c5b6c
Compare
This build is OK for merge ✅ |
🔴distilgpt2_fp16: FAILED: MIGraphX is not within tolerance - check verbose output |
40c5b6c
to
76ab520
Compare
namespace migraphx { | ||
inline namespace MIGRAPHX_INLINE_NS { | ||
namespace onnx { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(iff you edit this file again, then.. please cut-n-paste the reference information as comments -- as in some recent operators, e.g. in qlinearconcat).
My basic design question is: the operator reference says: "A Function to fuse calculation for Scale, Zero Point and FP32->8Bit conversion of FP32 Input data.".
But we are not fusing any calculations here.. any thoughts on it? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked the compiled gpu version from of dynamicquantizelinear_2d_test.onnx.
From that it seems there are lot of fused instructions in kernels:
module: "main"
@0 = check_context::migraphx::gpu::context -> float_type, {}, {}, target_id=0
@1 = hip::hip_allocate_memory[shape=int8_type, {96}, {1},id=main:scratch] -> int8_type, {96}, {1}, target_id=0
@2 = load[offset=0,end=48](@1) -> float_type, {3, 4}, {4, 1}, target_id=0
x = @param:x -> float_type, {3, 4}, {4, 1}, target_id=0
@4 = hip::copy_to_gpu(x,@2) -> float_type, {3, 4}, {4, 1}, target_id=0
@5 = reshape_lazy[dims={12}](@4) -> float_type, {12}, {1}, target_id=0
@6 = load[offset=48,end=60](@1) -> [float_type, {1}, {1}, int64_type, {1}, {1}], target_id=0
@7 = gpu::topk[k=1,axis=0,largest=0](@5,@6) -> [float_type, {1}, {1}, int64_type, {1}, {1}], target_id=0
@8 = load[offset=64,end=68](@1) -> float_type, {1}, {1}, target_id=0
@9 = get_tuple_elem[index=0](@7) -> float_type, {1}, {1}, target_id=0
@10 = gpu::code_object[code_object=4536,symbol_name=min_kernel,global=1,local=1024,](@9,@8) -> float_type, {1}, {1}, target_id=0
@11 = load[offset=80,end=92](@1) -> [float_type, {1}, {1}, int64_type, {1}, {1}], target_id=0
@12 = gpu::topk[k=1,axis=0,largest=1](@5,@11) -> [float_type, {1}, {1}, int64_type, {1}, {1}], target_id=0
@13 = load[offset=48,end=52](@1) -> float_type, {1}, {1}, target_id=0
@14 = get_tuple_elem[index=0](@12) -> float_type, {1}, {1}, target_id=0
@15 = gpu::code_object[code_object=4872,symbol_name=max_sub_mul_kernel,global=1,local=1024,](@14,@10,@13) -> float_type, {1}, {1}, target_id=0
@16 = load[offset=80,end=81](@1) -> uint8_type, {1}, {1}, target_id=0
@17 = gpu::code_object[code_object=4976,symbol_name=neg_div_clip_nearbyint_convert_kernel,global=1,local=1024,](@10,@15,@16) -> uint8_type, {1}, {1}, target_id=0
@18 = hip::copy_from_gpu(@15) -> float_type, {1}, {1}, target_id=0
@19 = hip::copy_from_gpu(@17) -> uint8_type, {1}, {1}, target_id=0
@20 = load[offset=64,end=76](@1) -> uint8_type, {3, 4}, {4, 1}, target_id=0
@21 = multibroadcast[out_lens={3, 4},out_dyn_dims={}](@17) -> uint8_type, {3, 4}, {0, 0}, target_id=0
@22 = multibroadcast[out_lens={3, 4},out_dyn_dims={}](@15) -> float_type, {3, 4}, {0, 0}, target_id=0
@23 = gpu::code_object[code_object=5072,symbol_name=quantizelinear_kernel,global=6,local=1024,](@4,@22,@21,@20) -> uint8_type, {3, 4}, {4, 1}, target_id=0
@24 = hip::copy_from_gpu(@23) -> uint8_type, {3, 4}, {4, 1}, target_id=0
@25 = hip::sync_stream(@24,@18,@19) -> uint8_type, {3, 4}, {4, 1}, target_id=0
@26 = @return(@25,@18,@19), target_id=0
From my point of view, that satisfies the requirement of fusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(iff you edit this file again, then.. please cut-n-paste the reference information as comments -- as in some recent operators, e.g. in qlinearconcat).
Added the comments after the onnx namespace
migraphx::literal{migraphx::shape{x_type}, {std::numeric_limits<uint8_t>::max()}}); | ||
auto q_min = info.add_literal( | ||
migraphx::literal{migraphx::shape{x_type}, {std::numeric_limits<uint8_t>::min()}}); | ||
auto x_reshape = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this step be necessary (for static shapes) if X is 1-D? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's not needed in that case, I will add a check to skip the conversion in the 1-D case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The optimizer will remove the redundant reshapes automatically so its not necessary to do this here.
No need to revert the change either, if you already updated it, just a note for the future.
76ab520
to
2a43a71
Compare
auto q_min = info.add_literal( | ||
migraphx::literal{migraphx::shape{x_type}, {std::numeric_limits<uint8_t>::min()}}); | ||
auto x_reshape = x; | ||
if(not(x_shape.lens().size() == 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: Just one comparison would suffice: if(x_shape.lens().size() != 1)
|
||
// y_scale = (maximum(0, max(x)) - minimum(0, min(x))) / (qmax - qmin) | ||
auto sub0 = info.add_instruction(migraphx::make_op("sub"), max_x, min_x); | ||
auto y_scale = info.add_instruction(migraphx::make_op("div"), sub0, q_max); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional. (q_max - q_min)
instead of just q_max
on line 130.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excuse me, this is a compile time step, not a new (additional) compute instruction. Hence the above suggestion. Sorry about any confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
e2ca869
to
bf8eadd
Compare
auto div = info.add_instruction(migraphx::make_op("sub"), q_max, q_min); | ||
auto sub0 = info.add_instruction(migraphx::make_op("sub"), max_x, min_x); | ||
// qmax - qmin is always 255 | ||
auto div = q_max; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// https://onnx.ai/onnx/operators/onnx__QuantizeLinear.html isn't any longer just for uint8. Please remove the comment on line 129.
auto div = q_max - q_min;
// line 130.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link you provided is for QuantizeLinear, DynamicQuantizeLinear only has support for uint8: https://onnx.ai/onnx/operators/onnx__DynamicQuantizeLinear.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. There is (still) a disconnect between these two operators, and it shouldn't be!
But please do change line 130 to as suggested: the calculation then applies logically for any type -- including uint8. And that way qmax - qmin is still 255 here. This is not a compute step, but just a compile time- expression. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto div = q_max - q_min;
// line 130 <- doesn't compile, so I've changed the implementation and added a third literal called scale with the value of q_max-q_min.
bf8eadd
to
6008732
Compare
6008732
to
16e51d1
Compare
16e51d1
to
7cb098b
Compare
Codecov ReportAttention:
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## develop #2489 +/- ##
========================================
Coverage 91.50% 91.50%
========================================
Files 453 454 +1
Lines 17183 17208 +25
========================================
+ Hits 15723 15747 +24
- Misses 1460 1461 +1 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you.
Add support for DynamicQuantizeLinear operator.
This implementation only works with static shapes due to the use of reshape. Reshape is needed to get the max and min values across the entire input tensor. Any idea on how to solve that is welcome.
Fixes: migraphx-benchmark#91