Fp8 support for MatMul on cuda #22698

amarin16 · 2024-11-01T21:50:32Z

No description provided.

include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h

+// Option values:
+// - "0": Gemm fp8 mode is not enabled. [DEFAULT]
+// - "1": Gemm fp8 mode is enabled.
+static const char* const kOrtSessionOptionsGemmCudaFloat8E4M3FN = "enable_gemm_cuda_float8E4M3FN";


onnxruntime/test/providers/cpu/math/matmul_test.cc

+  // TODO add a unit test that has more than 256 elements, so that multiple blocks are used
+  // test.AddInput<MLFloat16>("A", {2, 4}, FloatsToMLFloat16s({1.0f, 2.0f, 3.0f, 4.0f, -1.0f, -2.0f, -3.0f, -4.0f}));
+  // test.AddInput<MLFloat16>("B", {4, 3}, FloatsToMLFloat16s({1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f}));
+  // test.AddOutput<MLFloat16>("Y", {2, 3}, FloatsToMLFloat16s({10.0f, 10.0f, 10.0f, -10.0f, -10.0f, -10.0f}));


onnxruntime/test/providers/cpu/math/matmul_test.cc

+  // test.AddInput<MLFloat16>("B", {4, 3}, FloatsToMLFloat16s({10.f, 11.f, 12.f, 13.f, 14.f, 15.f, 16.f, 17.f, 18.f, 19.f, 20.f, 21.f}));
+  // test.AddInput<MLFloat16>("B", {4, 3}, FloatsToMLFloat16s({17.f, 19.f, 21.f, 13.f, 14.f, 15.f, 16.f, 17.f, 18.f, 19.f, 20.f, 21.f}));
+  // test.AddOutput<MLFloat16>("Y", {2, 3}, FloatsToMLFloat16s({160.0f, 170.0f, 180.0f, -160.0f, -170.0f, -180.0f}));


onnxruntime/core/providers/cuda/math/matmul.cc

tianleiwu · 2024-11-06T00:57:26Z

onnxruntime/test/providers/cpu/math/matmul_test.cc

+  // test.AddInput<MLFloat16>("B", {4, 3}, FloatsToMLFloat16s({1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f}));
+  // test.AddOutput<MLFloat16>("Y", {2, 3}, FloatsToMLFloat16s({10.0f, 10.0f, 10.0f, -10.0f, -10.0f, -10.0f}));
+
+  test.AddInput<MLFloat16>("A", {2, 2}, FloatsToMLFloat16s({1.0f, 1.0f, 1.0f, 1.0f}));


For FP8 GEMM, pointers and matrix dimension (strides?) must support 16-byte alignment.

Could you test input like {2, 16} instead of {2, 2}.

Tried that as well, but see similar differences between actual and expected results

The difference between f_expected[i] and f_actual[i] is 11.55078125, which exceeds tolerance, where f_expected[i] evaluates to 16, f_actual[i] evaluates to 4.44921875, and tolerance evaluates to 0.018500000238418579.

I saw that you changed A to {2, 16}, but B and output are still not 16-byte alignment.
How about testing M=16, K=32, N=16?

Detail requirements: https://docs.nvidia.com/cuda/cublas/index.html#tensor-core-usage

((op_A == CUBLAS_OP_N ? m : k) * AtypeSize) % 16 == 0 ((op_B == CUBLAS_OP_N ? k : n) * BtypeSize) % 16 == 0 (m * CtypeSize) % 16 == 0 (lda * AtypeSize) % 16 == 0 (ldb * BtypeSize) % 16 == 0 (ldc * CtypeSize) % 16 == 0 intptr_t(A) % 16 == 0 intptr_t(B) % 16 == 0 intptr_t(C) % 16 == 0

We need add some checks before enabling fp8. If requirements are not satisfied, we shall not use fp8.

I am seeing a similar behavior for M=16, K=32, N=16 as well.

The difference between f_expected[i] and f_actual[i] is 7.1015625, which exceeds tolerance, where f_expected[i] evaluates to 16, f_actual[i] evaluates to 8.8984375, and tolerance evaluates to 0.018500000238418579.

We need add some checks before enabling fp8. If requirements are not satisfied, we shall not use fp8.

sure, we can add this

tianleiwu · 2024-11-06T01:19:09Z

onnxruntime/core/providers/cuda/math/matmul.cc

+  float* quant_float = (float*)malloc(256 * sizeof(float));
+  for (int i = 0; i < 256; i ++) {
+    quant_float[i] = i;
+  }
+  float std_quant = ComputeStandardDeviation(quant_float, 256);
+  free(quant_float);


quant_float is const vector, which means std_quant can be a constant. Why do we need compute it online?

hariharans29 · 2024-11-06T18:51:26Z

include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h

+// Option values:
+// - "0": Gemm fp8 mode is not enabled. [DEFAULT]
+// - "1": Gemm fp8 mode is enabled.
+static const char* const kOrtSessionOptionsGemmCudaFloat8E4M3FN = "enable_gemm_cuda_float8E4M3FN";


Since this is CUDA EP specific, should this be a generic session option or a CUDA EP provider option ?

onnxruntime/test/providers/cpu/math/matmul_test.cc

+  // test.AddInput<MLFloat16>("A", {2, 2}, FloatsToMLFloat16s({1.0f, 1.0f, 1.0f, 1.0f}));
+  // test.AddInput<MLFloat16>("B", {2, 2}, FloatsToMLFloat16s({1.0f, 1.0f, 1.0f, 1.0f}));
+  // test.AddOutput<MLFloat16>("Y", {2, 2}, FloatsToMLFloat16s({2.0f, 2.0f, 2.0f, 2.0f}));


amarin16 added 30 commits June 19, 2024 17:15

add test skeleton and config option for cuda fp8

61c4cb0

update MatMul_float8E4M3FN

f67ec74

add initial cublasLtMatmul logic

36d403c

Merge branch 'main' into HEAD

a0a43b1

update logic

dc84631

update spacing

967f174

Use min_cuda_architecture 900

7d2f528

create and use 1.0 scales instead of getting them from the input

0328f94

compute scale using model weights as float

e2fd3c3

remove unnecessary span

4936cd2

small update

e63e597

merge main

ffbefc0

introduce ComputeScaleKernel

97bf3f1

use kernel to compute scale

35fc798

use instantiation to get rid of runtime error

d3b6685

only keep needed instantiations

ba18ef8

Merge branch 'main' into dev/amarin16/fp8

46a4bd0

small fixes

c3a2434

refator fp8 logic into separate function

784cffe

Specialize ComputeDefault for MLFloat16

b3d7731

remove template from kernel wrapper

ba17d9c

handle case when deviation is 0

1e5326c

cublasLtMatmulAlgoGetHeuristic no longer returns error

e723488

create fp8 tensors for left_X, right_X

046c5dc

Add transpose kernel

34cdd15

use cuda allocator, existing transpose kernel

0b46a22

compute scale using fp16, copy to device and use it

a2f8390

use DefaultCudaStream in PrePack

52fb51c

update print, use CUDA_R8_F_E4MR for ADesc and BDesc

f2176b7

merge main

1b64388

amarin16 added 4 commits October 22, 2024 19:39

cleanup

1256074

fix cublasLtMatmulAlgoGetHeuristic result

926268b

Merge branch 'main' into dev/amarin16/fp8

6c2b078

update interface after merge

07c91ae

github-advanced-security bot found potential problems Nov 1, 2024

View reviewed changes

tianleiwu reviewed Nov 6, 2024

View reviewed changes

onnxruntime/core/providers/cuda/math/matmul.cc Show resolved Hide resolved

tianleiwu reviewed Nov 6, 2024

View reviewed changes

onnxruntime/core/providers/cuda/math/matmul.cc Outdated Show resolved Hide resolved

tianleiwu reviewed Nov 6, 2024

View reviewed changes

amarin16 added 4 commits November 6, 2024 16:27

Add cublaSetStream call

2d28c7d

use CublasLtHandle() instead of cublasLtCreate()

fe8f7b4

use {2, 16} dimensions in test

402ade6

Merge branch 'main' into dev/amarin16/fp8

de23a1a

hariharans29 reviewed Nov 6, 2024

View reviewed changes

use M=16, K=32, N=16 in test

ca2cc8f

github-advanced-security bot found potential problems Nov 6, 2024

View reviewed changes

tianleiwu requested a review from xadupre November 14, 2024 19:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fp8 support for MatMul on cuda #22698

Fp8 support for MatMul on cuda #22698

amarin16 commented Nov 1, 2024

tianleiwu Nov 6, 2024 •

edited

Loading

amarin16 Nov 6, 2024

tianleiwu Nov 6, 2024 •

edited

Loading

amarin16 Nov 6, 2024

tianleiwu Nov 6, 2024

hariharans29 Nov 6, 2024

Fp8 support for MatMul on cuda #22698

Are you sure you want to change the base?

Fp8 support for MatMul on cuda #22698

Conversation

amarin16 commented Nov 1, 2024

tianleiwu Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

amarin16 Nov 6, 2024

Choose a reason for hiding this comment

tianleiwu Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

amarin16 Nov 6, 2024

Choose a reason for hiding this comment

tianleiwu Nov 6, 2024

Choose a reason for hiding this comment

hariharans29 Nov 6, 2024

Choose a reason for hiding this comment

tianleiwu Nov 6, 2024 •

edited

Loading

tianleiwu Nov 6, 2024 •

edited

Loading