-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Significant perf drop when using dynamic ranges in GPU kernel #470
Comments
Yeah this is due to KA allowing for arbitrary dimensions instead of just limiting the user to You end up in https://github.com/JuliaGPU/CUDA.jl/blob/7f725c0a117c2ba947015f48833630605501fb3a/src/CUDAKernels.jl#L178 KernelAbstractions.jl/src/nditeration.jl Line 73 in c5fe83c
So if we don't know the ndrange the code here won't be optimized away and we do execute quite a few integer operations more. Which is particular costly for small kernels. One avenue I have been meaning to try, but never got around to is to ensure that most of the index calculation occur using |
Can you use |
Here are the outputs from the |
There is a performance pitfall that I didn't expect... KernelAbstractions.jl/src/nditeration.jl Line 83 in c5fe83c
We have a call to |
x-ref: JuliaGPU/GPUArrays.jl#520 |
In contrast with constant a ndrange:
The division is turned into a |
Should one do more globally what was done for Metal in there? |
I am not sure right now.
|
Just a pointer to the relevant Metal implementation of using hardware indices when available: https://github.com/JuliaGPU/Metal.jl/blob/28576b3f4601ed0b32ccc74485cddf9a6f56249c/src/broadcast.jl#L82-L147 |
Running the CUDA benchmarks from the HPCBenchmarks.jl tests returns significant performance drop using KA with dynamic range definition. The blow tests are performed on GH200 using local CUDA 12.4 install and Julia 10.2.
ndrange
as implemented in the benchmark https://github.com/PTsolvers/HPCBenchmarks.jl/blob/a5985aaaf931efb0caf194d669e3bfcb90c5c08e/CUDA/diffusion_3d.jl#L39:returns a nearly 50% perf drop compared to plain CUDA.jl and reference CUDA C:
returns
The text was updated successfully, but these errors were encountered: