Add test and benchmark for explicit dot GEMM #637

brunomazzottiamd · 2024-09-11T20:45:48Z

Summary

This PR adds missing test and benchmark features to the explicit dot GEMM Triton kernel developed in the scope of https://github.com/ROCm/triton-internal/issues/169. It also adds GEMM implemented with tl.dot to the mix so we can easily compare it to PyTorch GEMM and Triton GEMM implemented with explicit dot, both in terms of correctness and performance.

Help / How to run

python python/perf-kernels/multreduce_matmul_kernel.py -h

usage: multreduce_matmul_kernel.py [-h] [-M M] [-N N] [-K K] [--use-bias] [--use-dot] {run,bench}

C = A * B + BIAS matrix multiplication kernel for small matrices (M ≤ 8)

positional arguments:
  {run,bench}  mode of operation:
                 run: run Triton kernel for a given (M, N, K) shape
                 bench: benchmark performance for target shapes

options:
  -h, --help   show this help message and exit

kernel shape arguments:
  -M M         rows of matrix A (must be less or equal to 8)
  -N N         columns of matrix A / rows of matrix B
  -K K         columns of matrix B
  --use-bias   use BIAS vector
  --use-dot    use tl.dot for dot product

Running the Triton kernel for a single shape

python python/perf-kernels/multreduce_matmul_kernel.py run -M 1 -N 8192 -K 28672

Checking correctness of a single target shape

pytest -vvv python/perf-kernels/multreduce_matmul_kernel.py::test_matmul[1-4096-4096-True]

Checking correctness of all target shapes

pytest -vvv python/perf-kernels/multreduce_matmul_kernel.py

Benchmarking all target shapes

python python/perf-kernels/multreduce_matmul_kernel.py bench

Sample benchmark result on MI300X:

fp16_multreduce_matmul_kernel:
     M        N        K  Torch (GiB/s)  Triton Dot (GiB/s)  Triton Multreduce (GiB/s)
0  1.0   8192.0  28672.0        3861.00             2508.09                    2305.36
1  1.0   6144.0   6144.0        2034.46             2640.22                    1714.58
2  1.0   4096.0   4096.0        2048.43             1191.51                    1107.05
3  2.0  16384.0  16384.0        2418.02             2794.15                    2715.15
4  1.0   4096.0   3078.0        1066.43              839.09                     369.35
5  1.0     23.0     31.0           0.21                0.05                       0.05
6  1.0     23.0    128.0           0.84                0.20                       0.20

python/perf-kernels/multreduce_matmul_kernel.py

vgokhale

Thank you!

brunomazzottiamd self-assigned this Sep 11, 2024

brunomazzottiamd marked this pull request as ready for review September 11, 2024 21:26

rahulbatra85 reviewed Sep 13, 2024

View reviewed changes

python/perf-kernels/multreduce_matmul_kernel.py Outdated Show resolved Hide resolved

brunomazzottiamd mentioned this pull request Sep 18, 2024

Add Layernorm kernel #641

Merged

brunomazzottiamd force-pushed the 271_add_test_and_bench_to_multreduce_matmul branch 2 times, most recently from c0c697a to f6209e7 Compare September 24, 2024 14:26

brunomazzottiamd force-pushed the 271_add_test_and_bench_to_multreduce_matmul branch from f6209e7 to 14a7e75 Compare October 1, 2024 14:03

brunomazzottiamd requested a review from rahulbatra85 October 1, 2024 14:06

xiaohuguo2023 reviewed Oct 1, 2024

View reviewed changes

python/perf-kernels/multreduce_matmul_kernel.py Show resolved Hide resolved

brunomazzottiamd force-pushed the 271_add_test_and_bench_to_multreduce_matmul branch from 02ddbfd to b48012f Compare October 2, 2024 12:40

brunomazzottiamd requested a review from xiaohuguo2023 October 2, 2024 12:41

vgokhale reviewed Oct 10, 2024

View reviewed changes

python/perf-kernels/multreduce_matmul_kernel.py Outdated Show resolved Hide resolved

vgokhale approved these changes Oct 10, 2024

View reviewed changes

Add test and benchmark for explicit dot GEMM

0c16e7a

brunomazzottiamd force-pushed the 271_add_test_and_bench_to_multreduce_matmul branch from b48012f to 0c16e7a Compare October 11, 2024 19:31

brunomazzottiamd requested a review from vgokhale October 11, 2024 19:33

vgokhale approved these changes Oct 11, 2024

View reviewed changes

brunomazzottiamd merged commit b6633f3 into ROCm:main_perf Oct 14, 2024
4 checks passed

brunomazzottiamd deleted the 271_add_test_and_bench_to_multreduce_matmul branch October 14, 2024 12:46

micmelesse pushed a commit that referenced this pull request Oct 28, 2024

Add test and benchmark for explicit dot GEMM (#637)

6a8427f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add test and benchmark for explicit dot GEMM #637

Add test and benchmark for explicit dot GEMM #637

brunomazzottiamd commented Sep 11, 2024 •

edited

Loading

vgokhale left a comment

Add test and benchmark for explicit dot GEMM #637

Add test and benchmark for explicit dot GEMM #637

Conversation

brunomazzottiamd commented Sep 11, 2024 • edited Loading

Summary

Help / How to run

Running the Triton kernel for a single shape

Checking correctness of a single target shape

Checking correctness of all target shapes

Benchmarking all target shapes

vgokhale left a comment

Choose a reason for hiding this comment

brunomazzottiamd commented Sep 11, 2024 •

edited

Loading