Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Blocks per SM #1165

Merged
merged 29 commits into from
Jan 7, 2022
Merged

GPU Blocks per SM #1165

merged 29 commits into from
Jan 7, 2022

Conversation

rchen20
Copy link
Member

@rchen20 rchen20 commented Nov 19, 2021

Summary

These will be done after obtaining clarification on launch_bounds parameter from AMD:

  • Add check for max blocks_per_sm of 32?

  • Work out design of HIP blocks_per_sm or min_warps_per_eu.

  • Add default for HIP.

  • Refactor CudaKernelFixedSM from test/old-tests/unit/test-kernel.cpp. Can be done as part of test re-org.

  • Repeat for SYCL? Don't see a way to do this in SYCL.

Design review

  • Adds *_exec_explicit for cuda. The original *_exec policies call upon the explicit policies.

Comment on lines +363 to +367
#if defined(RAJA_TEST_EXHAUSTIVE)
// avoid compilation error:
// tpl/camp/include/camp/camp.hpp(104): error #456: excessive recursion at instantiation of class
RAJA::cuda_work<256>,
RAJA::cuda_work<1024>
#endif
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MrBurmark @trws Is this ok? Removing this test case seems to make everything better. I could remove the 1024 test case instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, if we have to relegate one, I would keep 1024. But hopefully we can avoid this change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I kept 1024 in there a couple lines down. Github just has weird formatting, so it's difficult to see.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That error in the comment makes me curious. What would cause excessive recursion from this?

As to the counts, if this is testing blocks per SM then it probably needs a check to ensure the hardware can actually run that many and give an XFAIL or similar for the case where that's not possible.

Copy link
Member Author

@rchen20 rchen20 Nov 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a set of WorkGroup tests which have 3 * 3 * 3 * 2 permutations of policies. Cutting it down to 2 * 3 * 3 * 2 seems to be fine for gcc/8.3.1 + cuda/10.1.243.

That particular count refers to the number of threads, although you're right it's a good idea to add a check for blocks per SM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, if it's threads per block that's safe enough. That's been supported since cc2.0.

Even if it's that many, camp shouldn't fail that way since the vast majority of the algorithms are non-recursive. Mind filing some info on this, maybe a camp issue, so I can take a look at it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trws Made a CAMP issue here LLNL/camp#91. Thanks!

@rchen20
Copy link
Member Author

rchen20 commented Nov 30, 2021

@rhornung67 @MrBurmark This should be ready to go, when you get a chance would you mind taking a look at this? Thanks!

* Populate and return a Vtable object where the
* call operator is a device function
*/
template < typename T, typename Vtable_T, size_t BLOCK_SIZE, size_t BLOCKS_PER_SM, bool Async >
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to AMD docs this should be WARPS_PER_EU instead of BLOCKS_PER_SM.
https://rocmdocs.amd.com/en/latest/Programming_Guides/Kernel_language.html?highlight=launch_bounds#launch-bounds

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, I've applied the conversion suggested by ROCM where WARPS_PER_EU = (THREADS_PER_BLOCK * BLOCKS_PER_SM) / 32 at the sites of the HIP __launch_bounds__. This way, we can keep BLOCKS_PER_SM as the standard.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that that will keeps things the same across cuda and hip, but I'm not sure that's our goal. I thought we were trying to expose the underlying programming model and this is a place where hip and cuda differ. @rhornung67 @trws what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should be faithful to the underlying PM since users should be aware of basic concepts like CUDA warp size is 32 threads, HIP "wavefront" size is 64 threads, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that leaves us to either specify WARPS_PER_EU directly or do something in between like BLOCKS_PER_CU. I see where the documentation mentions that formula, but it also says that how many EU there are per CU is not known at compile time so it's not possible to convert BLOCKS_PER_CU to WARPS_PER_EU at compile time without assuming the value of EU_PER_CU of a certain architecture.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we mean by 'EU' here? It should be CU, right? AMD CU (compute unit) and NVIDIA SM (streaming multiprocessor) are analogous; i.e., the smallest functional unit on a GPU. What am I missing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with MIN_WARPS_PER_EU. EU is execution unit (SIMD), see the link in the first post.
I'm not sure that formula makes much sense for the GPUs we're using anyway as we have 4 EUs per CU.
If we have 256 THREADS_PER_BLOCK and 1 BLOCKS_PER_SM that formula yields 8 WARPS_PER_EU when it should be just 1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Annoying addition of EU to mean execution unit aside, I'd say we should pass the lockstep width through. If we pretend it's 32, it's entirely possible for a user to write code with it that will deadlock that would be correct if we used the correct size.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure we're on the same page, as far as I understand It isn't SIMD, it's more like hyperthreads or similar. It's the functional unit within a CU which has the resources we're trying to reason about, and there may be 1 or 4 of them per CU. This is not a terribly easy thing to reason about, since it's another level in the hierarchy that we aren't used to. SIMD is below the EU. I think the Nvidia side technically has this too, but they don't expose the resources of the EUs separately from the resources of a CU. In general they try relatively hard to hide the fact that each SM actually schedules 4 quarter-warps simultaneously on sub-elements from the user. Either way, apparently we're stuck with this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true. I'm not sure what passing the lockstep size through would look like. I'm pretty sure we have cuda warp and hip wavefront sizes defined somewhere.

Comment on lines 43 to 44
template <bool async>
struct LaunchExecute<RAJA::expt::cuda_launch_t<async, 0>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to keep a specialization that does not use launch bounds? This would affect whether the launch policies default values for num_threads and BLOCKS_PER_SM should be 0 or 1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure whether to keep this, but didn't want to alter @artv3's interface, and left him the choice to decide/change this interface later. Yes, having num_threads = 0 here makes things somewhat inconsistent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably keep it then and let @artv3 remove/change it later. Its probably easier to let BLOCKS_PER_SM default to 1 then so it will always be valid.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I'll change BLOCKS_PER_SM to default to 1 everywhere else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out I did remove this already, as indicated by the red removed background . . . However, the original cuda_launch_t policy is intact for use in our testing framework.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that you should probably put this one back in, unremove it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this back as you suggested. Also changed the default num_threads to 1 because that is more consistent with other policies, and the number of threads is not actually used in this policy anyway. Documented this in the code as well. Hopefully this is ok with @artv3?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enhancements always welcomed! 👍

@rhornung67 rhornung67 added this to the 2022.01 release milestone Dec 10, 2021
artv3
artv3 previously approved these changes Dec 21, 2021
@rchen20
Copy link
Member Author

rchen20 commented Jan 6, 2022

@rhornung67 I've added some documentation and modified one of the examples to help with understanding the change. Let me know what you think.

std::cout << "\n Running RAJA CUDA explicit (2 blocks per SM) vector addition...\n";

// _rajacuda_explicit_vector_add_start
RAJA::forall<RAJA::cuda_exec_explicit<CUDA_BLOCK_SIZE/2, 2, false>>(RAJA::RangeSegment(0, N),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a named bool variable for last template param for clarity; i.e.,
bool descriptive_name = false;
RAJA::forall<RAJA::cuda_exec_explicit<.....descriptive_name>>(...

rhornung67
rhornung67 previously approved these changes Jan 6, 2022
Copy link
Member

@rhornung67 rhornung67 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK to me.

@rchen20 rchen20 merged commit 947cd5a into develop Jan 7, 2022
@rchen20 rchen20 deleted the task/chen59/minblocks branch January 7, 2022 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants