Skip to content

The simplest but fast implementation of matrix multiplication in CUDA.

License

Notifications You must be signed in to change notification settings

andylolu2/simpleGEMM

Repository files navigation

simpleGEMM

img-uNPCSD5UDSQHHgRJpcDy7Lzf
Generated by DALL·E 3

This is an extremely minimalistic but fast implementation of matrix multiplication in CUDA. The source code is a single, 200-line file gemm.cuh which implements half-precision tensor core matrix multiplication, optimised for Turing (SM75) architecture.

The implementation builds on top of CuTe from CUTLASS, a low-level interface for tensor manipulation in CUDA. The code is well-commented and is meant to be easily readable (minimal CUDA/C++ background knowledge required) and hackable.

Benchmark against standard implementations (see main.cu and reference.cu):

$ ./main
Usage: ./main M N K iters

$ ./main 4096 4096 4096 1000
Time elapse: 6043.59ms
TFLOPS: 22.7413

$ ./main 8192 8192 8192 100
Time elapse: 4819.51ms
TFLOPS: 22.8138

$ ./reference 4096 4096 4096 1000
Time elapse: 6040.42ms
TFLOPS: 22.7532

$ ./reference 8192 8192 8192 100
Time elapse: 4657.08ms
TFLOPS: 23.6095

The theoretical maximum for the hardware I used (RTX 2060) is 26 TFLOPS.

Quick start

Requires CUDA installed. Check out https://docs.nvidia.com/cuda/cuda-installation-guide-linux/ for instructions. If you don't have a compatible GPU, you can run this in Colab: Open In Colab

Compile the main.cu file:

nvcc \
    --include-path ./ \
    --include-path cutlass/include \
    --generate-code=arch=compute_75,code=[compute_75,sm_75] \
    --expt-relaxed-constexpr \
    -forward-unknown-to-host-compiler \
    -std=c++17 \
    -O3 \
    -o build/main \
    main.cu

And run!

$ ./build/main
Usage: ./main M N K iters

$ ./build/main 4096 4096 4096 1000
Time elapse: 6043.59ms
TFLOPS: 22.7413

You can also build with CMake (a better option for development):

$ mkdir build
$ cd build/
$ cmake ..
-- Configuring done
-- Generating done
-- Build files have been written to: /workspaces/simpleGEMM/build
$ make main 
Consolidate compiler generated dependencies of target main
[ 50%] Building CUDA object CMakeFiles/main.dir/main.cu.o
[100%] Linking CUDA executable main
[100%] Built target main
$ ./main
Usage: ./main M N K iters

What's missing

The code trades off generality for simplicity:

  • Only supports fp16 matmul out of the box. It should be quite easy to move to bf16, though.
  • Optimised for SM75 w/ tensor cores. This is probably sub-optimal for SM80+ (e.g. A100), but probably not terrible either.
  • Assumes (asserts) the inputs are divisible by the block size.
  • Assumes the inputs are in row-major layout. (Though you probably only want to use a row-major layout anyway, as other combinations are 10-30% slower.)
  • Doesn't do software pipelining. (interleaving global memory load for the next tile with computation.)
  • Is only optimal for "normal" problem sizes. For more exotic problem sizes like small-M/N with large-K, specialised implementations like split-K kernel is likely to perform better.

About

The simplest but fast implementation of matrix multiplication in CUDA.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published