[tune gemm v3.4] Add xcd-based pid remapping and change back to rocprofv1 #630

zhanglx13 · 2024-08-17T04:21:53Z

This PR reverts #613 since there is a severe problem with rocprofv2 described in ticket#228.
The problem is that rocprofv2 will "miss" a lot of kernels in the tuning space. Therefore, sub-optimal config is picked.

We will switch back to rocprofv2 when the issue is resolved.

This PR also enabled xcd-based pid remapping. I need to run more experiments to understand the effects of xcd-based remapping and group_size_m (as described in ticket#229).
To disable xcd-based remapping, change this line from

if NUM_XCDS != 1:

to

if NUM_XCDS == 1:

- set --iters=200 as default. This is enough since the time is stable after the first few runs. - Filter out kernel time that is too large. We use the first kernel time as the threshold. There must be something wrong with the kernel if its elapsedTime is larger than the first run. We need to investigate the reason. For now, just filter them out.

xiaohuguo2023 · 2024-08-17T22:19:51Z

python/perf-kernels/tune_gemm/tune_gemm.py

@@ -54,7 +54,7 @@ def get_full_tuning_space():
    block_k_range = [16, 32, 64, 128, 256]
    split_k_range = [1, 2, 4, 5, 6, 8, 10, 12, 16, 18, 24]
    num_warps_range = [1, 2, 4, 8]
-    group_m_range = [1, 4, 8, 16, 32]
+    group_m_range = [1, 2, 4, 8, 16]


any reason why we get rid of 2, and 32? If we can't enable XCD mapping, we may need 32.

Let me add XCD mapping and re-think about the range for group_m then.

xiaohuguo2023 · 2024-08-18T20:47:42Z

I think we also need enable irregular shapes tuning by removing below two lines

https://github.com/ROCm/triton/blob/e21d43cc414180bd9ccec121e164af2ae3faf290/python/perf-kernels/tune_gemm/tune_gemm.py#L160C13-L161C25

scxiao · 2024-08-19T13:57:09Z

python/perf-kernels/tune_gemm/matmul_kernel.py

@@ -19,8 +30,9 @@ def matmul_kernel(a_ptr, b_ptr, c_ptr, bias_ptr, M, N, K, stride_am, stride_ak,
        group_id = pid // num_pid_in_group
        first_pid_m = group_id * GROUP_SIZE_M
        group_size_m = min(num_pid_m - first_pid_m, GROUP_SIZE_M)
-        pid_m = first_pid_m + (pid % group_size_m)
+        pid_m = first_pid_m + ((pid % num_pid_in_group) % group_size_m)


Does this change make an impact?

not really. This affects the swizzling in the very last group when M % GROUP_SIZE_M !=0, which is not usually in our tuning space.

xiaohuguo2023 · 2024-08-19T11:31:34Z

python/perf-kernels/tune_gemm/tune_gemm.py

@@ -157,7 +157,7 @@ def prune_configs(M, N, K, configs, elemBytes_a, elemBytes_b):
            if num_warps < 4:
                continue
            # check if tiling is integer multiple of GEMM size because we have no boundary check
-            if M % BLOCK_SIZE_M != 0 or N % BLOCK_SIZE_N != 0 or K % BLOCK_SIZE_K != 0:
+            if M % BLOCK_SIZE_M != 0 or N % BLOCK_SIZE_N != 0:


M, N could be irregular as well ?

xiaohuguo2023 · 2024-08-19T11:33:38Z

python/perf-kernels/tune_gemm/tune_gemm.py

@@ -366,11 +358,12 @@ def matmul(a, b, c, bias, block_m, block_n, block_k, group_m, split_k, num_warps
    grid = triton.cdiv(M, block_m) * triton.cdiv(N, block_n), split_k
    stride_bias = bias.stride(0) if use_bias else 0
    EVEN_K = K % block_k == 0
+    num_xcds = 1 if split_k > 1 else 8


XCD = 8 is only applicable to MI300X ?

…ofv1 (#630) * Change to rocprofv1 * improve post processing of rocprof results - set --iters=200 as default. This is enough since the time is stable after the first few runs. - Filter out kernel time that is too large. We use the first kernel time as the threshold. There must be something wrong with the kernel if its elapsedTime is larger than the first run. We need to investigate the reason. For now, just filter them out. * Add xcd-based pid remapping * Enable EVEN_K=false for large gemms * Update readme

Change to rocprofv1

05aead8

zhanglx13 requested a review from xiaohuguo2023 August 17, 2024 04:59

xiaohuguo2023 reviewed Aug 17, 2024

View reviewed changes

zhanglx13 added 2 commits August 18, 2024 16:22

Add xcd-based pid remapping

e355a42

Enable EVEN_K=false for large gemms

cba3d19

zhanglx13 changed the title ~~[tune gemm] Change back to rocprofv1~~ [tune gemm] Add xcd-based pid remapping and change back to rocprofv1 Aug 18, 2024

zhanglx13 changed the title ~~[tune gemm] Add xcd-based pid remapping and change back to rocprofv1~~ [tune gemm v3.4] Add xcd-based pid remapping and change back to rocprofv1 Aug 18, 2024

Update readme

907605a

zhanglx13 force-pushed the back_to_rocprofv1 branch from c550c5b to 907605a Compare August 19, 2024 03:08

zhanglx13 requested review from xiaohuguo2023, scxiao and vgokhale August 19, 2024 04:52

scxiao reviewed Aug 19, 2024

View reviewed changes

xiaohuguo2023 approved these changes Aug 19, 2024

View reviewed changes

zhanglx13 merged commit 15cb3a8 into main_perf Aug 19, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune gemm v3.4] Add xcd-based pid remapping and change back to rocprofv1 #630

[tune gemm v3.4] Add xcd-based pid remapping and change back to rocprofv1 #630

zhanglx13 commented Aug 17, 2024 •

edited

Loading

xiaohuguo2023 Aug 17, 2024

zhanglx13 Aug 18, 2024

xiaohuguo2023 commented Aug 18, 2024 •

edited

Loading

scxiao Aug 19, 2024

zhanglx13 Aug 19, 2024

xiaohuguo2023 Aug 19, 2024

xiaohuguo2023 Aug 19, 2024

[tune gemm v3.4] Add xcd-based pid remapping and change back to rocprofv1 #630

[tune gemm v3.4] Add xcd-based pid remapping and change back to rocprofv1 #630

Conversation

zhanglx13 commented Aug 17, 2024 • edited Loading

xiaohuguo2023 Aug 17, 2024

Choose a reason for hiding this comment

zhanglx13 Aug 18, 2024

Choose a reason for hiding this comment

xiaohuguo2023 commented Aug 18, 2024 • edited Loading

scxiao Aug 19, 2024

Choose a reason for hiding this comment

zhanglx13 Aug 19, 2024

Choose a reason for hiding this comment

xiaohuguo2023 Aug 19, 2024

Choose a reason for hiding this comment

xiaohuguo2023 Aug 19, 2024

Choose a reason for hiding this comment

zhanglx13 commented Aug 17, 2024 •

edited

Loading

xiaohuguo2023 commented Aug 18, 2024 •

edited

Loading