Added new cuda kernel for encoder forwards using three dimensional kernels #459
+39
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Spent the afternoon trying to understand how the multi dimensional cuda kernel instantiation works and came up with an example for the encoder forwards but I'm having the issue that for large block sizes its slower. Would love for someone with more understanding of how this works to take a look.
This is to try to apply the advice that @ngc92 gave for this topic: #406
Kernel 3:
block_size 32 | time 0.0309 ms | bandwidth 3257.16 GB/s
block_size 64 | time 0.0293 ms | bandwidth 3436.13 GB/s
block_size 128 | time 0.0291 ms | bandwidth 3463.98 GB/s
block_size 256 | time 0.0294 ms | bandwidth 3428.10 GB/s
block_size 512 | time 0.0293 ms | bandwidth 3436.86 GB/s
block_size 1024 | time 0.0300 ms | bandwidth 3351.67 GB/s
Kernel 4:
block_size 32 | time 0.0306 ms | bandwidth 3286.23 GB/s
block_size 64 | time 0.0295 ms | bandwidth 3416.79 GB/s
block_size 128 | time 0.0292 ms | bandwidth 3444.44 GB/s
block_size 256 | time 0.0295 ms | bandwidth 3413.71 GB/s
block_size 512 | time 0.0315 ms | bandwidth 3194.80 GB/s
block_size 1024 | time 0.0518 ms | bandwidth 1942.42 GB/s
I'm having a hard time intuitively understanding why it could be slower since its removes all of the modulo, and division operations