-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU Blocks per SM #1165
GPU Blocks per SM #1165
Conversation
#if defined(RAJA_TEST_EXHAUSTIVE) | ||
// avoid compilation error: | ||
// tpl/camp/include/camp/camp.hpp(104): error #456: excessive recursion at instantiation of class | ||
RAJA::cuda_work<256>, | ||
RAJA::cuda_work<1024> | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MrBurmark @trws Is this ok? Removing this test case seems to make everything better. I could remove the 1024 test case instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, if we have to relegate one, I would keep 1024. But hopefully we can avoid this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I kept 1024 in there a couple lines down. Github just has weird formatting, so it's difficult to see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That error in the comment makes me curious. What would cause excessive recursion from this?
As to the counts, if this is testing blocks per SM then it probably needs a check to ensure the hardware can actually run that many and give an XFAIL or similar for the case where that's not possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a set of WorkGroup tests which have 3 * 3 * 3 * 2 permutations of policies. Cutting it down to 2 * 3 * 3 * 2 seems to be fine for gcc/8.3.1 + cuda/10.1.243.
That particular count refers to the number of threads, although you're right it's a good idea to add a check for blocks per SM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, if it's threads per block that's safe enough. That's been supported since cc2.0.
Even if it's that many, camp shouldn't fail that way since the vast majority of the algorithms are non-recursive. Mind filing some info on this, maybe a camp issue, so I can take a look at it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trws Made a CAMP issue here LLNL/camp#91. Thanks!
@rhornung67 @MrBurmark This should be ready to go, when you get a chance would you mind taking a look at this? Thanks! |
* Populate and return a Vtable object where the | ||
* call operator is a device function | ||
*/ | ||
template < typename T, typename Vtable_T, size_t BLOCK_SIZE, size_t BLOCKS_PER_SM, bool Async > |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to AMD docs this should be WARPS_PER_EU instead of BLOCKS_PER_SM.
https://rocmdocs.amd.com/en/latest/Programming_Guides/Kernel_language.html?highlight=launch_bounds#launch-bounds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct, I've applied the conversion suggested by ROCM where WARPS_PER_EU = (THREADS_PER_BLOCK * BLOCKS_PER_SM) / 32
at the sites of the HIP __launch_bounds__
. This way, we can keep BLOCKS_PER_SM
as the standard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that that will keeps things the same across cuda and hip, but I'm not sure that's our goal. I thought we were trying to expose the underlying programming model and this is a place where hip and cuda differ. @rhornung67 @trws what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we should be faithful to the underlying PM since users should be aware of basic concepts like CUDA warp size is 32 threads, HIP "wavefront" size is 64 threads, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that leaves us to either specify WARPS_PER_EU directly or do something in between like BLOCKS_PER_CU. I see where the documentation mentions that formula, but it also says that how many EU there are per CU is not known at compile time so it's not possible to convert BLOCKS_PER_CU to WARPS_PER_EU at compile time without assuming the value of EU_PER_CU of a certain architecture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we mean by 'EU' here? It should be CU, right? AMD CU (compute unit) and NVIDIA SM (streaming multiprocessor) are analogous; i.e., the smallest functional unit on a GPU. What am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with MIN_WARPS_PER_EU
. EU is execution unit (SIMD), see the link in the first post.
I'm not sure that formula makes much sense for the GPUs we're using anyway as we have 4 EUs per CU.
If we have 256 THREADS_PER_BLOCK and 1 BLOCKS_PER_SM that formula yields 8 WARPS_PER_EU when it should be just 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Annoying addition of EU to mean execution unit aside, I'd say we should pass the lockstep width through. If we pretend it's 32, it's entirely possible for a user to write code with it that will deadlock that would be correct if we used the correct size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be sure we're on the same page, as far as I understand It isn't SIMD, it's more like hyperthreads or similar. It's the functional unit within a CU which has the resources we're trying to reason about, and there may be 1 or 4 of them per CU. This is not a terribly easy thing to reason about, since it's another level in the hierarchy that we aren't used to. SIMD is below the EU. I think the Nvidia side technically has this too, but they don't expose the resources of the EUs separately from the resources of a CU. In general they try relatively hard to hide the fact that each SM actually schedules 4 quarter-warps simultaneously on sub-elements from the user. Either way, apparently we're stuck with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is true. I'm not sure what passing the lockstep size through would look like. I'm pretty sure we have cuda warp and hip wavefront sizes defined somewhere.
include/RAJA/policy/cuda/teams.hpp
Outdated
template <bool async> | ||
struct LaunchExecute<RAJA::expt::cuda_launch_t<async, 0>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to keep a specialization that does not use launch bounds? This would affect whether the launch policies default values for num_threads and BLOCKS_PER_SM should be 0 or 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure whether to keep this, but didn't want to alter @artv3's interface, and left him the choice to decide/change this interface later. Yes, having num_threads = 0
here makes things somewhat inconsistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably keep it then and let @artv3 remove/change it later. Its probably easier to let BLOCKS_PER_SM default to 1 then so it will always be valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I'll change BLOCKS_PER_SM
to default to 1 everywhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out I did remove this already, as indicated by the red removed background . . . However, the original cuda_launch_t
policy is intact for use in our testing framework.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that you should probably put this one back in, unremove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this back as you suggested. Also changed the default num_threads
to 1 because that is more consistent with other policies, and the number of threads is not actually used in this policy anyway. Documented this in the code as well. Hopefully this is ok with @artv3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enhancements always welcomed! 👍
@rhornung67 I've added some documentation and modified one of the examples to help with understanding the change. Let me know what you think. |
examples/tut_add-vectors.cpp
Outdated
std::cout << "\n Running RAJA CUDA explicit (2 blocks per SM) vector addition...\n"; | ||
|
||
// _rajacuda_explicit_vector_add_start | ||
RAJA::forall<RAJA::cuda_exec_explicit<CUDA_BLOCK_SIZE/2, 2, false>>(RAJA::RangeSegment(0, N), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use a named bool variable for last template param for clarity; i.e.,
bool descriptive_name = false;
RAJA::forall<RAJA::cuda_exec_explicit<.....descriptive_name>>(...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK to me.
Summary
This PR completes a feature to address Add min blocks per SM policies for all relevant GPU policy cases #1041
It does the following:
Set default in existing policy to 1 block per SM
Teams - modify
cuda_launch_t
WorkGroup?
Forall
cuda_occ requires blocks per SM to be initialized to 0, will calculate blocks per SM automatically
These will be done after obtaining clarification on launch_bounds parameter from AMD:
Add check for max blocks_per_sm of 32?
Work out design of HIP blocks_per_sm or min_warps_per_eu.
Add default for HIP.
Refactor
CudaKernelFixedSM
fromtest/old-tests/unit/test-kernel.cpp
. Can be done as part of test re-org.Repeat for SYCL? Don't see a way to do this in SYCL.
Design review
*_exec_explicit
for cuda. The original*_exec
policies call upon the explicit policies.