Micro optimization for `softmax_forward_kernel5` #762

insop · 2024-09-20T02:28:11Z

This branch includes a micro-optimization for softmax_forward_kernel5.

Summary

~~use warpReduceMax in attention_forward.cu to use __shfl_down_sync to be consistent with the other kernels (reduce to all threads in a warp)~~
micro optimization for softmax_forward_kernel5
- Result from ncu ./profile_gpt2cu: compared to the original code, the with this optimization gain improvements (left: original code, right: modified code):
  - Duration: 1.47 ms -> 1.38 ms
  - Compute (SM) [%]: 77.11% -> 78.68%
  - DRAM Throughput [%]: 45.03% -> 47.91%
tests done:
- ./profile_gpt2cu
- ./attention_forward 4
- ./attention_forward 5

Output from modified code

NCU log using A30

make profile_gpt2cu NO_MULTI_GPU=1
ncu ./profile_gpt2cu

  softmax_forward_kernel5(__nv_bfloat16 *, float, const __nv_bfloat16 *, int, int), 2024-Sep-20 01:45:01, Context 1, Stream 16
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           1.21
    SM Frequency                                                             cycle/usecond                         929.76
    Elapsed Cycles                                                                   cycle                        1283575
    Memory [%]                                                                           %                          54.15
    DRAM Throughput                                                                      %                          47.91
    Duration                                                                       msecond                           1.38
    L1/TEX Cache Throughput                                                              %                          54.50
    L2 Cache Throughput                                                                  %                          51.48
    SM Active Cycles                                                                 cycle                     1275362.68
    Compute (SM) [%]                                                                     %                          78.68
    ---------------------------------------------------------------------- --------------- ------------------------------

Output from the original code

NCU log using A30

make profile_gpt2cu NO_MULTI_GPU=1
ncu ./profile_gpt2cu

  softmax_forward_kernel5(__nv_bfloat16 *, float, const __nv_bfloat16 *, int, int), 2024-Sep-20 01:49:03, Context 1, Stream 16
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           1.21
    SM Frequency                                                             cycle/usecond                         928.26
    Elapsed Cycles                                                                   cycle                        1366538
    Memory [%]                                                                           %                          45.03
    DRAM Throughput                                                                      %                          45.03
    Duration                                                                       msecond                           1.47
    L1/TEX Cache Throughput                                                              %                          33.10
    L2 Cache Throughput                                                                  %                          48.18
    SM Active Cycles                                                                 cycle                     1358789.59
    Compute (SM) [%]                                                                     %                          77.11
    ---------------------------------------------------------------------- --------------- ------------------------------

output from `./attention_forward`

nvcc -O3 --use_fast_math -lcublas -lcublasLt attention_forward.cu -o attention_forward

testing softmax_forward_kernel4

# ./attention_forward 4
enable_tf32: 1
Using kernel 4
Checking block size 32.
-0.529510 -0.529297
0.889394 0.889160
0.881674 0.881836
0.651789 0.651855
-0.483486 -0.483398
1.000000 1.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
Checking block size 64.
-0.529510 -0.529297
0.889394 0.889160
0.881674 0.881836
0.651789 0.651855
-0.483486 -0.483398
1.000000 1.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
Checking block size 128.
-0.529510 -0.529297
0.889394 0.889160
0.881674 0.881836
0.651789 0.651855
-0.483486 -0.483398
1.000000 1.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
Checking block size 256.
-0.529510 -0.529297
0.889394 0.889160
0.881674 0.881836
0.651789 0.651855
-0.483486 -0.483398
1.000000 1.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
Checking block size 512.
-0.529510 -0.529297
0.889394 0.889160
0.881674 0.881836
0.651789 0.651855
-0.483486 -0.483398
1.000000 1.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
All results match. Starting benchmarks.

block_size   32 | time 2.794404 ms
block_size   64 | time 2.136679 ms
block_size  128 | time 2.125906 ms
block_size  256 | time 2.128598 ms
block_size  512 | time 2.151445 ms

testing softmax_forward_kernel5


# ./attention_forward 5
enable_tf32: 1
Using kernel 5
Checking block size 32.
-0.529510 -0.531250
0.889394 0.890625
0.881674 0.882812
0.651789 0.652344
-0.483486 -0.484375
Checking block size 64.
-0.529510 -0.531250
0.889394 0.890625
0.881674 0.882812
0.651789 0.652344
-0.483486 -0.484375
Checking block size 128.
-0.529510 -0.531250
0.889394 0.890625
0.881674 0.882812
0.651789 0.652344
-0.483486 -0.484375
Checking block size 256.
-0.529510 -0.531250
0.889394 0.890625
0.881674 0.882812
0.651789 0.652344
-0.483486 -0.484375
Checking block size 512.
-0.529510 -0.531250
0.889394 0.890625
0.881674 0.882812
0.651789 0.652344
-0.483486 -0.484375
All results match. Starting benchmarks.

block_size   32 | time 2.016379 ms
block_size   64 | time 1.455155 ms
block_size  128 | time 1.452482 ms
block_size  256 | time 1.450271 ms
block_size  512 | time 1.454224 ms

- Micro optimize softmax_forward5 - use __shfl_xor_sync for warpReducMax for all threads return the max

insop · 2024-09-22T15:57:13Z

@gordicaleksa , @ngc92, @ademeure,

It would be great if you can take a look at this PR when you get a chance.
Thank you.

ngc92 · 2024-09-23T16:18:50Z

could you give a bit more detail about these changes? From a quick look, it seems like you changed a block-wise reduction into just warp-level reduction. Is that correct?

insop · 2024-09-24T06:15:28Z

could you give a bit more detail about these changes? From a quick look, it seems like you changed a block-wise reduction into just warp-level reduction. Is that correct?

Hi @ngc92

When I profiled softmax_forward_kernel5(), I found that the last part of the function, which updates the row and writes back to the out memory, consumes a large portion of the time the function runs.

So, I looked more closely at the last part and determined that organizing the memory write as 4 floats improves memory throughput due to better coalesced access.

ngc92 · 2024-09-24T12:33:32Z

@insop
the optimization of kernel 5 looks clear to me, it's the changes of kernel 4 that I'm worried about.

@ngc92

… per @ngc92

insop · 2024-09-24T15:51:12Z

@insop the optimization of kernel 5 looks clear to me, it's the changes of kernel 4 that I'm worried about.

Hi @ngc92

Looking at this change for softmax_forward_4 again, I think I will remove that part of the change from this PR so that we can review only softmax_forward_5.
I agree that my change transitions from block-level reduction to warp-level reduction, and I may need to consider the implications of this change, regardless of the test passing.
The fact that softmax_forward_5 in the main file attention.cuh uses warp-level reduction (utilizing __shfl_xor_sync instead of __shfl_down_sync) for all threads, link led me to update softmax_forward_4 initially.

Now, softmax_forward_4 change is reverted, PTAL.

insop · 2024-09-28T07:27:51Z

@insop the optimization of kernel 5 looks clear to me, it's the changes of kernel 4 that I'm worried about.

Hi @ngc92

Looking at this change for softmax_forward_4 again, I think I will remove that part of the change from this PR so that we can review only softmax_forward_5.

I agree that my change transitions from block-level reduction to warp-level reduction, and I may need to consider the implications of this change, regardless of the test passing.

The fact that softmax_forward_5 in the main file attention.cuh uses warp-level reduction (utilizing __shfl_xor_sync instead of __shfl_down_sync) for all threads, link led me to update softmax_forward_4 initially.

Now, softmax_forward_4 change is reverted, PTAL.

Hi @ngc92
PTAL,

Thank you

insop song and others added 4 commits September 16, 2024 16:00

minor update

87deba5

softmax_forward_5 optimziation

78b447f

Micro optimize softmax_forward5

321dbad

- Micro optimize softmax_forward5 - use __shfl_xor_sync for warpReducMax for all threads return the max

Update shared memory since maxvals is not needed

052434a

Reverted softmax_forward_kernel4 changes as it requires more review…

ac87b4b

… per @ngc92

Merge remote-tracking branch 'upstream/master' into insop/softmax_fwd5

da74e04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Micro optimization for `softmax_forward_kernel5` #762

Micro optimization for `softmax_forward_kernel5` #762

insop commented Sep 20, 2024 •

edited

Loading

insop commented Sep 22, 2024 •

edited

Loading

ngc92 commented Sep 23, 2024

insop commented Sep 24, 2024

ngc92 commented Sep 24, 2024

insop commented Sep 24, 2024 •

edited

Loading

insop commented Sep 28, 2024

Micro optimization for softmax_forward_kernel5 #762

Are you sure you want to change the base?

Micro optimization for softmax_forward_kernel5 #762

Conversation

insop commented Sep 20, 2024 • edited Loading

Summary

Output from modified code

Output from the original code

output from ./attention_forward

insop commented Sep 22, 2024 • edited Loading

ngc92 commented Sep 23, 2024

insop commented Sep 24, 2024

ngc92 commented Sep 24, 2024

insop commented Sep 24, 2024 • edited Loading

insop commented Sep 28, 2024

Micro optimization for `softmax_forward_kernel5` #762

Micro optimization for `softmax_forward_kernel5` #762

insop commented Sep 20, 2024 •

edited

Loading

output from `./attention_forward`

insop commented Sep 22, 2024 •

edited

Loading

insop commented Sep 24, 2024 •

edited

Loading