FP8 lossy downcast issue with "ref" implementation #2517

umangyadav · 2023-12-05T21:39:36Z

https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/pull/2506/files
This PR had to disable FP8 tests for the CPU backend.

Ref implementation is doing Float -- > Fp8 -- > Float conversion but CPU backend is doing entire test in Float.

Therefore results come out slightly different.

need to figure out way to enable those tests again.

e.g.

ef:
module: "main"
@0 = @literal{2} -> float_type, {1}, {0}, target_id=0
@1 = @literal{3} -> float_type, {1}, {0}, target_id=0
c = @param:c -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
b = @param:b -> fp8e4m3fnuz_type, {3, 2, 7, 8}, {112, 56, 8, 1}, target_id=0
a = @param:a -> fp8e4m3fnuz_type, {3, 2, 8, 2}, {32, 16, 2, 1}, target_id=0
@5 = transpose[permutation={0, 1, 3, 2}](a) -> fp8e4m3fnuz_type, {3, 2, 2, 8}, {32, 16, 1, 2}, target_id=0
@6 = transpose[permutation={0, 1, 3, 2}](b) -> fp8e4m3fnuz_type, {3, 2, 8, 7}, {112, 56, 1, 8}, target_id=0
@7 = multibroadcast[out_lens={3, 2, 2, 8},out_dyn_dims={}](@1) -> float_type, {3, 2, 2, 8}, {0, 0, 0, 0}, target_id=0
@8 = convert[target_type=2](@5) -> float_type, {3, 2, 2, 8}, {32, 16, 1, 2}, target_id=0
@9 = mul(@7,@8) -> float_type, {3, 2, 2, 8}, {32, 16, 1, 2}, target_id=0
@10 = convert[target_type=12](@9) -> fp8e4m3fnuz_type, {3, 2, 2, 8}, {32, 16, 1, 2}, target_id=0
@11 = quant_dot(@10,@6) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
@12 = multibroadcast[out_lens={3, 2, 2, 7},out_dyn_dims={}](@0) -> float_type, {3, 2, 2, 7}, {0, 0, 0, 0}, target_id=0
@13 = mul(c,@12) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
@14 = add(@11,@13) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0

## ref quant_dot internally converts fp8e4m3fnuz_type to  float and does the matrix multiplication
# Float - > fp8 --> float
cpu:
module: "main"
@0 = cpu::preallocate[shape=int8_type, {1008}, {1},id=main:scratch] -> int8_type, {1008}, {1}, target_id=0
@1 = cpu::literal -> float_type, {3, 2, 2, 8}, {32, 16, 8, 1}, target_id=0
@2 = cpu::literal -> float_type, {1}, {0}, target_id=0
c = @param:c -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
b = @param:b -> fp8e4m3fnuz_type, {3, 2, 7, 8}, {112, 56, 8, 1}, target_id=0
a = @param:a -> fp8e4m3fnuz_type, {3, 2, 8, 2}, {32, 16, 2, 1}, target_id=0
@6 = convert[target_type=2](a) -> float_type, {3, 2, 8, 2}, {32, 16, 2, 1}, target_id=0
@7 = transpose[permutation={0, 1, 3, 2}](@6) -> float_type, {3, 2, 2, 8}, {32, 16, 1, 2}, target_id=0
@8 = convert[target_type=2](b) -> float_type, {3, 2, 7, 8}, {112, 56, 8, 1}, target_id=0
@9 = transpose[permutation={0, 1, 3, 2}](@8) -> float_type, {3, 2, 8, 7}, {112, 56, 1, 8}, target_id=0
@10 = load[offset=336,end=720](@0) -> float_type, {3, 2, 2, 8}, {32, 16, 8, 1}, target_id=0
@11 = dnnl::binary[post_ops={},algo=binary_mul](@1,@7,@10) -> float_type, {3, 2, 2, 8}, {32, 16, 8, 1}, target_id=0
@12 = load[offset=0,end=336](@0) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
@13 = dnnl::dot[post_ops={}](@11,@9,@12) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
@14 = multibroadcast[out_lens={3, 2, 2, 7},out_dyn_dims={}](@2) -> float_type, {3, 2, 2, 7}, {0, 0, 0, 0}, target_id=0
@15 = load[offset=672,end=1008](@0) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
@16 = dnnl::binary[post_ops={},algo=binary_mul](c,@14,@15) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
@17 = load[offset=336,end=672](@0) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
@18 = dnnl::binary[post_ops={},algo=binary_add](@13,@16,@17) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0

## GPU : 
# float - > fp8 -- > (fp8 inputs -->float32 accumulation) --> Float
@0 = check_context::migraphx::gpu::context -> float_type, {}, {}, target_id=0
@1 = hip::hip_allocate_memory[shape=int8_type, {432}, {1},id=main:scratch] -> int8_type, {432}, {1}, target_id=0
@2 = load[offset=336,end=432](@1) -> fp8e4m3fnuz_type, {3, 2, 2, 8}, {32, 16, 1, 2}, target_id=0
a = @param:a -> fp8e4m3fnuz_type, {3, 2, 8, 2}, {32, 16, 2, 1}, target_id=0
@4 = transpose[permutation={0, 1, 3, 2}](a) -> fp8e4m3fnuz_type, {3, 2, 2, 8}, {32, 16, 1, 2}, target_id=0
@5 = gpu::code_object[code_object=9120,symbol_name=convert_mul_convert_kernel,global=96,local=1024,](@4,@2) -> fp8e4m3fnuz_type, {3, 2, 2, 8}, {32, 16, 1, 2}, target_id=0
@6 = load[offset=0,end=336](@1) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
b = @param:b -> fp8e4m3fnuz_type, {3, 2, 7, 8}, {112, 56, 8, 1}, target_id=0
@8 = transpose[permutation={0, 1, 3, 2}](b) -> fp8e4m3fnuz_type, {3, 2, 8, 7}, {112, 56, 1, 8}, target_id=0
@9 = gpu::quant_gemm[alpha=1,beta=0,compute_fp32=1,trans_batch=0,solution_idx=0](@5,@8,@6) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
output = @param:output -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
c = @param:c -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0
@12 = gpu::code_object[code_object=9288,symbol_name=mul_add_kernel,global=42,local=1024,](c,@9,output) -> float_type, {3, 2, 2, 7}, {28, 14, 7, 1}, target_id=0

The text was updated successfully, but these errors were encountered:

umangyadav · 2023-12-06T23:34:07Z

Fix for this issue should work for all the hardwares includign MI300.

e.g. #2506 attempted fix for this by adding simplication for nested converts but it didnt' work on Mi300.

umangyadav · 2024-04-16T17:41:15Z

@lakhinderwalia FYI

lakhinderwalia · 2024-04-16T18:18:47Z

Thanks, @umangyadav. Yes, the right thing is to disable such apples-to-oranges tests. In this case the issue (ref vs GPU of test_quantizelinear_convert) is very similar, and to assume that it is working fine while the GPU execution optimizes out the convert step is simply an incorrect approach to test.

umangyadav added the FP8 issues related to FP8 implemenation label Dec 5, 2023

This was referenced Dec 5, 2023

FP8 QuantDot operation #2506

Merged

Add eliminate_nested_converts pass and add unit-tests #2520

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FP8 lossy downcast issue with "ref" implementation #2517

FP8 lossy downcast issue with "ref" implementation #2517

umangyadav commented Dec 5, 2023 •

edited

Loading

umangyadav commented Dec 6, 2023 •

edited

Loading

umangyadav commented Apr 16, 2024

lakhinderwalia commented Apr 16, 2024

FP8 lossy downcast issue with "ref" implementation #2517

FP8 lossy downcast issue with "ref" implementation #2517

Comments

umangyadav commented Dec 5, 2023 • edited Loading

umangyadav commented Dec 6, 2023 • edited Loading

umangyadav commented Apr 16, 2024

lakhinderwalia commented Apr 16, 2024

umangyadav commented Dec 5, 2023 •

edited

Loading

umangyadav commented Dec 6, 2023 •

edited

Loading