-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Micro optimization for softmax_forward_kernel5
#762
base: master
Are you sure you want to change the base?
Conversation
- Micro optimize softmax_forward5 - use __shfl_xor_sync for warpReducMax for all threads return the max
@gordicaleksa , @ngc92, @ademeure, It would be great if you can take a look at this PR when you get a chance. |
could you give a bit more detail about these changes? From a quick look, it seems like you changed a block-wise reduction into just warp-level reduction. Is that correct? |
Hi @ngc92 When I profiled So, I looked more closely at the last part and determined that organizing the memory write as 4 floats improves memory throughput due to better coalesced access. |
@insop |
Hi @ngc92
Now, |
Hi @ngc92 Thank you |
This branch includes a micro-optimization for
softmax_forward_kernel5
.Summary
usewarpReduceMax
inattention_forward.cu
to use__shfl_down_sync
to be consistent with the other kernels (reduce to all threads in a warp)micro optimization for
softmax_forward_kernel5
Result from
ncu ./profile_gpt2cu
: compared to the original code, the with this optimization gain improvements (left: original code, right: modified code):tests done:
./profile_gpt2cu
./attention_forward 4
./attention_forward 5
Output from modified code
Output from the original code
output from
./attention_forward