Align shared memory in fold & scan (only shuffle) #96
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR ensures that allocations of shared memory are properly aligned.
Motivation and context
Previously, shared memory was allocated without any padding. This caused that reads and stores may be misaligned, for instance when scanning an array containing (Bool, Int).
In particular, this may occur in the implementation of segmented scans. Segmented scans are typically implemented by pairing a value with a flag, as
(Bool, a)
. However, if one implements it as(a, Bool)
, and the size of the allocated array is not a multiple of the alignment ofa
, then this bug will trigger. Reads into the array ofa
s will be misaligned.This PR only fixes this issue for folds and scans using shuffle instructions.
Fixing this for folds and scans on onlder hardware is possible, but probably not worth it given the age of that hardware and complexity of the fix. I would thus propose to drop support for compute capabilities before 3.0.
How has this been tested?
Using various applications of scans, including segmented scans defined with
(a, Bool)
, on our RTX 4090.Types of changes
Checklist: