Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DynamicQuantizeLinear op #2489

Merged
merged 5 commits into from
Dec 12, 2023
Merged

Add DynamicQuantizeLinear op #2489

merged 5 commits into from
Dec 12, 2023

Conversation

gyulaz-htec
Copy link
Collaborator

Add support for DynamicQuantizeLinear operator.
This implementation only works with static shapes due to the use of reshape. Reshape is needed to get the max and min values across the entire input tensor. Any idea on how to solve that is welcome.

Fixes: migraphx-benchmark#91

@migraphx-bot
Copy link
Collaborator

migraphx-bot commented Nov 30, 2023

Test Batch Rate new
7cb098
Rate old
9d2003
Diff Compare
torchvision-resnet50 64 2,832.70 2,834.89 -0.08%
torchvision-resnet50_fp16 64 6,496.94 6,504.77 -0.12%
torchvision-densenet121 32 2,095.30 2,096.39 -0.05%
torchvision-densenet121_fp16 32 3,663.36 3,663.79 -0.01%
torchvision-inceptionv3 32 1,597.77 1,593.15 0.29%
torchvision-inceptionv3_fp16 32 2,563.84 2,561.29 0.10%
cadene-inceptionv4 16 722.21 722.57 -0.05%
cadene-resnext64x4 16 691.66 692.10 -0.06%
slim-mobilenet 64 8,333.46 8,334.20 -0.01%
slim-nasnetalarge 64 230.55 230.62 -0.03%
slim-resnet50v2 64 2,663.05 2,665.22 -0.08%
bert-mrpc-onnx 8 822.96 823.69 -0.09%
bert-mrpc-tf 1 385.53 389.31 -0.97%
pytorch-examples-wlang-gru 1 303.41 303.55 -0.05%
pytorch-examples-wlang-lstm 1 311.63 313.03 -0.45%
torchvision-resnet50_1 1 609.61 607.81 0.30%
torchvision-inceptionv3_1 1 343.60 345.51 -0.55%
cadene-dpn92_1 1 404.00 404.79 -0.20%
cadene-resnext101_1 1 328.15 328.36 -0.06%
slim-vgg16_1 1 459.24 459.17 0.02%
slim-mobilenet_1 1 2,074.25 2,110.55 -1.72%
slim-inceptionv4_1 1 212.51 214.65 -1.00%
onnx-taau-downsample 1 305.11 306.30 -0.39%
dlrm-criteoterabyte 1 21.59 21.63 -0.18%
dlrm-criteoterabyte_fp16 1 40.62 40.54 0.21%
agentmodel 1 5,905.94 5,884.50 0.36%
unet_fp16 2 54.78 54.75 0.05%
resnet50v1_fp16 1 931.21 945.09 -1.47%
bert_base_cased_fp16 64 903.21 903.34 -0.01%
bert_large_uncased_fp16 32 285.67 285.72 -0.02%
bert_large_fp16 1 166.59 166.68 -0.05%
distilgpt2_fp16 16 1,279.20 1,281.90 -0.21%

This build is OK for merge ✅

@migraphx-bot
Copy link
Collaborator

migraphx-bot commented Nov 30, 2023


     ✅ bert-mrpc-onnx: PASSED: MIGraphX meets tolerance

     ✅ bert-mrpc-tf: PASSED: MIGraphX meets tolerance

     ✅ pytorch-examples-wlang-gru: PASSED: MIGraphX meets tolerance

     ✅ pytorch-examples-wlang-lstm: PASSED: MIGraphX meets tolerance

     ✅ torchvision-resnet50_1: PASSED: MIGraphX meets tolerance

     ✅ torchvision-inceptionv3_1: PASSED: MIGraphX meets tolerance

     ✅ cadene-dpn92_1: PASSED: MIGraphX meets tolerance

     ✅ cadene-resnext101_1: PASSED: MIGraphX meets tolerance

     ✅ slim-vgg16_1: PASSED: MIGraphX meets tolerance

     ✅ slim-mobilenet_1: PASSED: MIGraphX meets tolerance

     ✅ slim-inceptionv4_1: PASSED: MIGraphX meets tolerance

     ✅ dlrm-criteoterabyte: PASSED: MIGraphX meets tolerance

     ✅ agentmodel: PASSED: MIGraphX meets tolerance

     ✅ unet: PASSED: MIGraphX meets tolerance

     ✅ resnet50v1: PASSED: MIGraphX meets tolerance

     ✅ bert_base_cased_fp16: PASSED: MIGraphX meets tolerance

     ✅ bert_large_uncased_fp16: PASSED: MIGraphX meets tolerance

     ✅ bert_large: PASSED: MIGraphX meets tolerance

🔴distilgpt2_fp16: FAILED: MIGraphX is not within tolerance - check verbose output

namespace migraphx {
inline namespace MIGRAPHX_INLINE_NS {
namespace onnx {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(iff you edit this file again, then.. please cut-n-paste the reference information as comments -- as in some recent operators, e.g. in qlinearconcat).

My basic design question is: the operator reference says: "A Function to fuse calculation for Scale, Zero Point and FP32->8Bit conversion of FP32 Input data.".

But we are not fusing any calculations here.. any thoughts on it? Thanks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked the compiled gpu version from of dynamicquantizelinear_2d_test.onnx.
From that it seems there are lot of fused instructions in kernels:

module: "main"
@0 = check_context::migraphx::gpu::context -> float_type, {}, {}, target_id=0
@1 = hip::hip_allocate_memory[shape=int8_type, {96}, {1},id=main:scratch] -> int8_type, {96}, {1}, target_id=0
@2 = load[offset=0,end=48](@1) -> float_type, {3, 4}, {4, 1}, target_id=0
x = @param:x -> float_type, {3, 4}, {4, 1}, target_id=0
@4 = hip::copy_to_gpu(x,@2) -> float_type, {3, 4}, {4, 1}, target_id=0
@5 = reshape_lazy[dims={12}](@4) -> float_type, {12}, {1}, target_id=0
@6 = load[offset=48,end=60](@1) -> [float_type, {1}, {1}, int64_type, {1}, {1}], target_id=0
@7 = gpu::topk[k=1,axis=0,largest=0](@5,@6) -> [float_type, {1}, {1}, int64_type, {1}, {1}], target_id=0
@8 = load[offset=64,end=68](@1) -> float_type, {1}, {1}, target_id=0
@9 = get_tuple_elem[index=0](@7) -> float_type, {1}, {1}, target_id=0
@10 = gpu::code_object[code_object=4536,symbol_name=min_kernel,global=1,local=1024,](@9,@8) -> float_type, {1}, {1}, target_id=0
@11 = load[offset=80,end=92](@1) -> [float_type, {1}, {1}, int64_type, {1}, {1}], target_id=0
@12 = gpu::topk[k=1,axis=0,largest=1](@5,@11) -> [float_type, {1}, {1}, int64_type, {1}, {1}], target_id=0
@13 = load[offset=48,end=52](@1) -> float_type, {1}, {1}, target_id=0
@14 = get_tuple_elem[index=0](@12) -> float_type, {1}, {1}, target_id=0
@15 = gpu::code_object[code_object=4872,symbol_name=max_sub_mul_kernel,global=1,local=1024,](@14,@10,@13) -> float_type, {1}, {1}, target_id=0
@16 = load[offset=80,end=81](@1) -> uint8_type, {1}, {1}, target_id=0
@17 = gpu::code_object[code_object=4976,symbol_name=neg_div_clip_nearbyint_convert_kernel,global=1,local=1024,](@10,@15,@16) -> uint8_type, {1}, {1}, target_id=0
@18 = hip::copy_from_gpu(@15) -> float_type, {1}, {1}, target_id=0
@19 = hip::copy_from_gpu(@17) -> uint8_type, {1}, {1}, target_id=0
@20 = load[offset=64,end=76](@1) -> uint8_type, {3, 4}, {4, 1}, target_id=0
@21 = multibroadcast[out_lens={3, 4},out_dyn_dims={}](@17) -> uint8_type, {3, 4}, {0, 0}, target_id=0
@22 = multibroadcast[out_lens={3, 4},out_dyn_dims={}](@15) -> float_type, {3, 4}, {0, 0}, target_id=0
@23 = gpu::code_object[code_object=5072,symbol_name=quantizelinear_kernel,global=6,local=1024,](@4,@22,@21,@20) -> uint8_type, {3, 4}, {4, 1}, target_id=0
@24 = hip::copy_from_gpu(@23) -> uint8_type, {3, 4}, {4, 1}, target_id=0
@25 = hip::sync_stream(@24,@18,@19) -> uint8_type, {3, 4}, {4, 1}, target_id=0
@26 = @return(@25,@18,@19), target_id=0

From my point of view, that satisfies the requirement of fusing.

Copy link
Collaborator Author

@gyulaz-htec gyulaz-htec Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(iff you edit this file again, then.. please cut-n-paste the reference information as comments -- as in some recent operators, e.g. in qlinearconcat).

Added the comments after the onnx namespace

migraphx::literal{migraphx::shape{x_type}, {std::numeric_limits<uint8_t>::max()}});
auto q_min = info.add_literal(
migraphx::literal{migraphx::shape{x_type}, {std::numeric_limits<uint8_t>::min()}});
auto x_reshape =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this step be necessary (for static shapes) if X is 1-D? Thanks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's not needed in that case, I will add a check to skip the conversion in the 1-D case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The optimizer will remove the redundant reshapes automatically so its not necessary to do this here.

No need to revert the change either, if you already updated it, just a note for the future.

@gyulaz-htec gyulaz-htec force-pushed the dynamic_quantize_linear branch from 76ab520 to 2a43a71 Compare December 4, 2023 12:17
auto q_min = info.add_literal(
migraphx::literal{migraphx::shape{x_type}, {std::numeric_limits<uint8_t>::min()}});
auto x_reshape = x;
if(not(x_shape.lens().size() == 1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: Just one comparison would suffice: if(x_shape.lens().size() != 1)


// y_scale = (maximum(0, max(x)) - minimum(0, min(x))) / (qmax - qmin)
auto sub0 = info.add_instruction(migraphx::make_op("sub"), max_x, min_x);
auto y_scale = info.add_instruction(migraphx::make_op("div"), sub0, q_max);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional. (q_max - q_min) instead of just q_max on line 130.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excuse me, this is a compile time step, not a new (additional) compute instruction. Hence the above suggestion. Sorry about any confusion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@gyulaz-htec gyulaz-htec force-pushed the dynamic_quantize_linear branch 2 times, most recently from e2ca869 to bf8eadd Compare December 6, 2023 08:51
auto div = info.add_instruction(migraphx::make_op("sub"), q_max, q_min);
auto sub0 = info.add_instruction(migraphx::make_op("sub"), max_x, min_x);
// qmax - qmin is always 255
auto div = q_max;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// https://onnx.ai/onnx/operators/onnx__QuantizeLinear.html isn't any longer just for uint8. Please remove the comment on line 129.
auto div = q_max - q_min; // line 130.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link you provided is for QuantizeLinear, DynamicQuantizeLinear only has support for uint8: https://onnx.ai/onnx/operators/onnx__DynamicQuantizeLinear.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. There is (still) a disconnect between these two operators, and it shouldn't be!

But please do change line 130 to as suggested: the calculation then applies logically for any type -- including uint8. And that way qmax - qmin is still 255 here. This is not a compute step, but just a compile time- expression. Thanks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto div = q_max - q_min; // line 130 <- doesn't compile, so I've changed the implementation and added a third literal called scale with the value of q_max-q_min.

@gyulaz-htec gyulaz-htec force-pushed the dynamic_quantize_linear branch from 6008732 to 16e51d1 Compare December 8, 2023 09:27
@gyulaz-htec gyulaz-htec force-pushed the dynamic_quantize_linear branch from 16e51d1 to 7cb098b Compare December 11, 2023 09:35
@codecov-commenter
Copy link

codecov-commenter commented Dec 11, 2023

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (9d2003a) 91.50% compared to head (3de3b2f) 91.50%.
Report is 6 commits behind head on develop.

❗ Current head 3de3b2f differs from pull request most recent head 1ec583d. Consider uploading reports for the commit 1ec583d to get more accurate results

Files Patch % Lines
src/onnx/parse_dynamicquantizelinear.cpp 96.00% 1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #2489   +/-   ##
========================================
  Coverage    91.50%   91.50%           
========================================
  Files          453      454    +1     
  Lines        17183    17208   +25     
========================================
+ Hits         15723    15747   +24     
- Misses        1460     1461    +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@lakhinderwalia lakhinderwalia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

@causten causten merged commit 5fe1b07 into develop Dec 12, 2023
8 of 9 checks passed
@causten causten deleted the dynamic_quantize_linear branch December 12, 2023 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DynamicQuantizeLinear operator is unsupported
7 participants