Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added new cuda kernel for encoder forwards using three dimensional kernels #459

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ChrisDryden
Copy link
Contributor

@ChrisDryden ChrisDryden commented May 25, 2024

Spent the afternoon trying to understand how the multi dimensional cuda kernel instantiation works and came up with an example for the encoder forwards but I'm having the issue that for large block sizes its slower. Would love for someone with more understanding of how this works to take a look.

This is to try to apply the advice that @ngc92 gave for this topic: #406

Kernel 3:
block_size 32 | time 0.0309 ms | bandwidth 3257.16 GB/s
block_size 64 | time 0.0293 ms | bandwidth 3436.13 GB/s
block_size 128 | time 0.0291 ms | bandwidth 3463.98 GB/s
block_size 256 | time 0.0294 ms | bandwidth 3428.10 GB/s
block_size 512 | time 0.0293 ms | bandwidth 3436.86 GB/s
block_size 1024 | time 0.0300 ms | bandwidth 3351.67 GB/s

Kernel 4:
block_size 32 | time 0.0306 ms | bandwidth 3286.23 GB/s
block_size 64 | time 0.0295 ms | bandwidth 3416.79 GB/s
block_size 128 | time 0.0292 ms | bandwidth 3444.44 GB/s
block_size 256 | time 0.0295 ms | bandwidth 3413.71 GB/s
block_size 512 | time 0.0315 ms | bandwidth 3194.80 GB/s
block_size 1024 | time 0.0518 ms | bandwidth 1942.42 GB/s

I'm having a hard time intuitively understanding why it could be slower since its removes all of the modulo, and division operations

@ChrisDryden
Copy link
Contributor Author

ChrisDryden commented May 27, 2024

Figured out that I needed to change the block size dynamically based off of the value of C and the current block size and it is now around .0020ms faster!

block_size 32 | time 0.0346 ms | bandwidth 2911.14 GB/s
block_size 64 | time 0.0319 ms | bandwidth 3153.09 GB/s
block_size 128 | time 0.0309 ms | bandwidth 3256.19 GB/s
block_size 256 | time 0.0311 ms | bandwidth 3235.72 GB/s
block_size 512 | time 0.0311 ms | bandwidth 3234.76 GB/s
block_size 1024 | time 0.0310 ms | bandwidth 3251.12 GB/s

block_size 32 | time 0.0350 ms | bandwidth 2876.90 GB/s
block_size 64 | time 0.0314 ms | bandwidth 3207.94 GB/s
block_size 128 | time 0.0313 ms | bandwidth 3211.82 GB/s
block_size 256 | time 0.0312 ms | bandwidth 3227.00 GB/s
block_size 512 | time 0.0318 ms | bandwidth 3163.65 GB/s
block_size 1024 | time 0.0329 ms | bandwidth 3058.42 GB/s

Was able to find this issue out by profiling and seeing that the bottleneck at the larger kernel sizes was caused by the fact that we were reaching the c < C and not continuing. By making this change I was able to see in the profiler that this condition check is no longer required for larger block sizes!

Screenshot 2024-05-27 at 2 33 46 PM Screenshot 2024-05-27 at 2 33 30 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant