-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proof of Concept] GPU kernel using block-local shared memory #73
base: main
Are you sure you want to change the base?
Conversation
9370790
to
35f2fe1
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #73 +/- ##
===========================================
- Coverage 88.15% 70.24% -17.92%
===========================================
Files 15 15
Lines 498 625 +127
===========================================
Hits 439 439
- Misses 59 186 +127
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
8ce6aa1
to
02b22f0
Compare
Looking at the graph again, this seems only true for the 3090? The H100 also got faster? |
No, the H100 is also ~2x slower with this kernel. The colors are not great in this plot. |
With JuliaGPU/Metal.jl#480, JuliaGPU/Metal.jl#487 and JuliaGPU/Metal.jl#488, all kernels now work with Apple Silicon GPUs. |
This is a proof-of-concept implementation of a more advanced kernel manually making use of block-shared memory.
GPU blocks are associated with NHS cells and then all threads in one block load all neighbor data from a neighboring cell into shared memory before working on the data. This allows for coalesced accesses.
Unfortunately, this kernel is almost 2x slower than the original simple implementation on an H100.
Thanks to @vchuravy for developing this kernel with me.