This repository provides a comprehensive benchmarking framework to evaluate the performance of GEMM (General Matrix Multiply) and Attention operations across multiple ROCm backends, including IREE, rocBLAS, hipBLASLt, Triton, and PyTorch. The goal is to gain a deep understanding of how IREE compares to other AMD backends and to continuously track and visualize performance metrics.
We currently lack a comprehensive understanding of how IREE's performance stacks up against other backends like rocBLAS and hipBLASLt, particularly within AMD hardware. This repository aims to fill that gap by providing a robust and extensible benchmarking suite.
To address this problem, we provide an extensible benchmarking framework with the following features:
- Efficient GEMM Benchmarking: Instead of full model inference, which can be slow, this suite benchmarks thousands of simple GEMM kernels to gather performance data rapidly.
- Extensible Harness: The framework supports multiple backends, allowing for easy comparison and integration of new backends as needed.
- GPU Strain Simulation: Techniques are implemented to simulate the GPU workload of full model execution using isolated GEMMs.
- Nightly Performance Benchmarking: A nightly runner automatically benchmarks the latest performance results.
- Frontend Interface: An online interface that is constantly updated to show current performance metrics.
The C++ benchmarking library forms the core of this suite, providing:
- Configurable Benchmark Runner: Customizable settings for various benchmarking scenarios.
- Data Initialization & Memory Management: Efficient handling of data and memory for accurate benchmarking.
- De-optimization Techniques: Methods to simulate realistic GPU workload scenarios.
- Backends:
- IREE (using the IREE C API)
- rocBLAS (C++ API)
- hipBLASLt (C++ API)
A Python-based data collection system integrates seamlessly with the C++ library:
- ctypes + ZMQ Integration: Interfaces with the C++ library to dispatch and manage benchmarking tasks.
- GEMM Problem Generation: Automatically generates various GEMM problems, including different shapes and data types.
- Parallel Dispatching: Distributes benchmarking tasks across multiple GPUs, pooling results efficiently.
- Performance Visualization: Generates comparison graphs and roofline plots to visualize performance data.
The autotuner is designed to optimize performance:
- MLIR Generation: Produces candidate MLIRs with varying workgroup and tiling sizes.
- Performance Tuning: Selects the best-performing configurations for GEMM operations.
In addition to GEMM, the framework also supports benchmarking Flash Attention operations:
- Shared Pipeline: Utilizes the same pipeline as GEMM benchmarking for consistency and efficiency.
- Python Codegen + Compilation: Specialized scripts for code generation and compilation of Flash Attention operations.
- Tuning with Spec MLIR Files: Implements tuning with MLIR files for FP8 and FP16 data types.
- Backends:
- Native torch (compiled with ROCm)
- Triton Flash Attention
Here are the updated setup and execution instructions for your README based on the GitHub workflow:
- Hardware: Currently, this repository is tested only on AMD MI300 GPUs. There are future plans to support heterogeneous AMD hardware benchmarks.
- Software:
- ROCm v6.1+ and associated libraries (rocBLAS, hipBLASLt)
- CMake, Ninja
- Python 3.10+
- Meson build system
-
Clone the Repository:
git clone https://github.com/suryajasper/rocm-gemm-benchmark.git cd rocm-gemm-benchmark
-
Install Required Dependencies:
sudo apt install -y libzmq3-dev libboost-numpy-dev python3-zmq python3-h5py python3-tqdm ninja-build meson
-
Update IREE Submodule:
cd third_party/iree git submodule update --init cd ../..
-
Build LLVM Project:
cd third_party/llvm-project cmake -G Ninja -B build/ -S llvm -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_PROJECTS="mlir;clang" cmake --build build/ cd ../..
-
Build IREE Kernels:
cd src/ireekernels cmake -G Ninja -B ../ireekernelsbuild cmake --build ../ireekernelsbuild cd ../..
-
Build the Main Project:
CXX=hipcc meson setup build cd build sudo ninja cd .. mkdir -p results
-
Set Up Python Environment:
python3 -m venv venv source venv/bin/activate pip install -r gemmbench/requirements.txt deactivate
Benchmarks can be run in parallel across multiple GPUs. If the --backends
argument is unspecified, all backends will be initialized and can be benchmarked simultaneously.
-
RocBLAS Benchmarks:
sudo pkill -f gemm-bench source venv/bin/activate for device in $(seq 0 2); do (sudo build/gemm-bench --device=$device &); done ./gb run --backends=rocblas --repeat=1 --output=results/rocblas.csv sudo pkill -f gemm-bench deactivate
-
HipBLASLt Benchmarks:
sudo pkill -f gemm-bench source venv/bin/activate for device in $(seq 0 2); do (sudo build/gemm-bench --device=$device &); done ./gb run --backends=hipblaslt --repeat=1 --output=results/hipblaslt.csv sudo pkill -f gemm-bench deactivate
-
IREE Benchmarks:
sudo pkill -f gemm-bench source venv/bin/activate for device in $(seq 0 2); do (sudo build/gemm-bench --device=$device &); done ./gb run --backends=iree --repeat=1 --output=results/iree.csv sudo pkill -f gemm-bench deactivate
-
SHARK Attention Benchmarks:
sudo pkill -f gemm-bench source venv/bin/activate for device in $(seq 0 2); do (sudo build/gemm-bench --device=$device &); done ./gb run --backends=sharkfa --suite=flash_attention --repeat=1 --output=results/sharkfa_llama_sdxl_attention.csv sudo pkill -f gemm-bench deactivate
Contributions are welcome! Please open an issue or submit a pull request for any improvements or additions.