GPU Blocks per SM #1165

rchen20 · 2021-11-19T03:29:28Z

Summary

This PR completes a feature to address Add min blocks per SM policies for all relevant GPU policy cases #1041
- Prior PR Pr from fork/1038 #1039 and test Add tests for CUDA and HIP kernels with fixed min blocks per SM. #1040 for reference
It does the following:
Set default in existing policy to 1 block per SM
Teams - modify cuda_launch_t
WorkGroup?
Forall
cuda_occ requires blocks per SM to be initialized to 0, will calculate blocks per SM automatically

These will be done after obtaining clarification on launch_bounds parameter from AMD:

Add check for max blocks_per_sm of 32?
Work out design of HIP blocks_per_sm or min_warps_per_eu.
Add default for HIP.
Refactor CudaKernelFixedSM from test/old-tests/unit/test-kernel.cpp. Can be done as part of test re-org.
Repeat for SYCL? Don't see a way to do this in SYCL.

Design review

Adds *_exec_explicit for cuda. The original *_exec policies call upon the explicit policies.

rchen20 · 2021-11-30T00:41:56Z

test/include/RAJA_test-workgroup.hpp

+                #if defined(RAJA_TEST_EXHAUSTIVE)
+                // avoid compilation error:
+                // tpl/camp/include/camp/camp.hpp(104): error #456: excessive recursion at instantiation of class
                RAJA::cuda_work<256>,
-                RAJA::cuda_work<1024>
+                #endif


@MrBurmark @trws Is this ok? Removing this test case seems to make everything better. I could remove the 1024 test case instead.

hmm, if we have to relegate one, I would keep 1024. But hopefully we can avoid this change.

Yes, I kept 1024 in there a couple lines down. Github just has weird formatting, so it's difficult to see.

That error in the comment makes me curious. What would cause excessive recursion from this?

As to the counts, if this is testing blocks per SM then it probably needs a check to ensure the hardware can actually run that many and give an XFAIL or similar for the case where that's not possible.

It's a set of WorkGroup tests which have 3 * 3 * 3 * 2 permutations of policies. Cutting it down to 2 * 3 * 3 * 2 seems to be fine for gcc/8.3.1 + cuda/10.1.243.

That particular count refers to the number of threads, although you're right it's a good idea to add a check for blocks per SM.

Ok, if it's threads per block that's safe enough. That's been supported since cc2.0.

Even if it's that many, camp shouldn't fail that way since the vast majority of the algorithms are non-recursive. Mind filing some info on this, maybe a camp issue, so I can take a look at it?

@trws Made a CAMP issue here LLNL/camp#91. Thanks!

rchen20 · 2021-11-30T00:43:19Z

@rhornung67 @MrBurmark This should be ready to go, when you get a chance would you mind taking a look at this? Thanks!

include/RAJA/policy/cuda/WorkGroup/WorkRunner.hpp

include/RAJA/policy/cuda/policy.hpp

include/RAJA/policy/cuda/teams.hpp

MrBurmark · 2021-11-30T17:22:14Z

include/RAJA/policy/hip/WorkGroup/Vtable.hpp

+* Populate and return a Vtable object where the
+* call operator is a device function
+*/
+template < typename T, typename Vtable_T, size_t BLOCK_SIZE, size_t BLOCKS_PER_SM, bool Async >


According to AMD docs this should be WARPS_PER_EU instead of BLOCKS_PER_SM.
https://rocmdocs.amd.com/en/latest/Programming_Guides/Kernel_language.html?highlight=launch_bounds#launch-bounds

That's correct, I've applied the conversion suggested by ROCM where WARPS_PER_EU = (THREADS_PER_BLOCK * BLOCKS_PER_SM) / 32 at the sites of the HIP __launch_bounds__. This way, we can keep BLOCKS_PER_SM as the standard.

I understand that that will keeps things the same across cuda and hip, but I'm not sure that's our goal. I thought we were trying to expose the underlying programming model and this is a place where hip and cuda differ. @rhornung67 @trws what do you think?

I agree that we should be faithful to the underlying PM since users should be aware of basic concepts like CUDA warp size is 32 threads, HIP "wavefront" size is 64 threads, etc.

So that leaves us to either specify WARPS_PER_EU directly or do something in between like BLOCKS_PER_CU. I see where the documentation mentions that formula, but it also says that how many EU there are per CU is not known at compile time so it's not possible to convert BLOCKS_PER_CU to WARPS_PER_EU at compile time without assuming the value of EU_PER_CU of a certain architecture.

What do we mean by 'EU' here? It should be CU, right? AMD CU (compute unit) and NVIDIA SM (streaming multiprocessor) are analogous; i.e., the smallest functional unit on a GPU. What am I missing?

I'm fine with MIN_WARPS_PER_EU. EU is execution unit (SIMD), see the link in the first post.
I'm not sure that formula makes much sense for the GPUs we're using anyway as we have 4 EUs per CU.
If we have 256 THREADS_PER_BLOCK and 1 BLOCKS_PER_SM that formula yields 8 WARPS_PER_EU when it should be just 1.

Annoying addition of EU to mean execution unit aside, I'd say we should pass the lockstep width through. If we pretend it's 32, it's entirely possible for a user to write code with it that will deadlock that would be correct if we used the correct size.

Just to be sure we're on the same page, as far as I understand It isn't SIMD, it's more like hyperthreads or similar. It's the functional unit within a CU which has the resources we're trying to reason about, and there may be 1 or 4 of them per CU. This is not a terribly easy thing to reason about, since it's another level in the hierarchy that we aren't used to. SIMD is below the EU. I think the Nvidia side technically has this too, but they don't expose the resources of the EUs separately from the resources of a CU. In general they try relatively hard to hide the fact that each SM actually schedules 4 quarter-warps simultaneously on sub-elements from the user. Either way, apparently we're stuck with this.

That is true. I'm not sure what passing the lockstep size through would look like. I'm pretty sure we have cuda warp and hip wavefront sizes defined somewhere.

include/RAJA/policy/cuda/kernel/CudaKernel.hpp

include/RAJA/policy/cuda/policy.hpp

MrBurmark · 2021-11-30T17:38:24Z

include/RAJA/policy/cuda/teams.hpp

-template <bool async>
-struct LaunchExecute<RAJA::expt::cuda_launch_t<async, 0>> {


Do we want to keep a specialization that does not use launch bounds? This would affect whether the launch policies default values for num_threads and BLOCKS_PER_SM should be 0 or 1.

I wasn't sure whether to keep this, but didn't want to alter @artv3's interface, and left him the choice to decide/change this interface later. Yes, having num_threads = 0 here makes things somewhat inconsistent.

We should probably keep it then and let @artv3 remove/change it later. Its probably easier to let BLOCKS_PER_SM default to 1 then so it will always be valid.

Sounds good, I'll change BLOCKS_PER_SM to default to 1 everywhere else.

Turns out I did remove this already, as indicated by the red removed background . . . However, the original cuda_launch_t policy is intact for use in our testing framework.

I mean that you should probably put this one back in, unremove it.

Added this back as you suggested. Also changed the default num_threads to 1 because that is more consistent with other policies, and the number of threads is not actually used in this policy anyway. Documented this in the code as well. Hopefully this is ok with @artv3?

Enhancements always welcomed! 👍

…k/chen59/minblocks

rchen20 · 2022-01-06T23:02:21Z

@rhornung67 I've added some documentation and modified one of the examples to help with understanding the change. Let me know what you think.

rhornung67 · 2022-01-06T23:34:38Z

examples/tut_add-vectors.cpp

+  std::cout << "\n Running RAJA CUDA explicit (2 blocks per SM) vector addition...\n";
+
+  // _rajacuda_explicit_vector_add_start
+  RAJA::forall<RAJA::cuda_exec_explicit<CUDA_BLOCK_SIZE/2, 2, false>>(RAJA::RangeSegment(0, N), 


Please use a named bool variable for last template param for clarity; i.e.,
bool descriptive_name = false;
RAJA::forall<RAJA::cuda_exec_explicit<.....descriptive_name>>(...

rhornung67

Looks OK to me.

rchen20 added 7 commits November 18, 2021 19:23

CUDA and HIP forall, with tests.

b225a2c

Merge branch 'develop' into task/chen59/minblocks

4d0fc3c

Missing comma.

f072234

Fix HIP, cleanup kernel.

a8f424e

Update Hip kernel test.

04c18b6

Explicit Teams policies.

5e00d9b

WorkGroup modifications for CUDA and HIP.

6817a3f

rchen20 requested review from MrBurmark, artv3, mdavis36, davidbeckingsale and rhornung67 November 30, 2021 00:37

rchen20 commented Nov 30, 2021

View reviewed changes

MrBurmark reviewed Nov 30, 2021

View reviewed changes

include/RAJA/policy/cuda/WorkGroup/WorkRunner.hpp Show resolved Hide resolved

MrBurmark reviewed Nov 30, 2021

View reviewed changes

include/RAJA/policy/cuda/policy.hpp Outdated Show resolved Hide resolved

MrBurmark reviewed Nov 30, 2021

View reviewed changes

include/RAJA/policy/cuda/teams.hpp Outdated Show resolved Hide resolved

rchen20 added 6 commits November 29, 2021 17:58

Remove old redundant policies. Hide _launch_explicit_t under expt.

454edc1

Fix Hip namespace.

58f143a

Fix HIP min warps per execution unit.

189de7c

Non-zero HIP threads and blocks.

b8115fd

Default HIP blocks per SM to 1.

eed727d

HIP math.

cd4ef2e

MrBurmark reviewed Nov 30, 2021

View reviewed changes

include/RAJA/policy/cuda/kernel/CudaKernel.hpp Outdated Show resolved Hide resolved

MrBurmark reviewed Nov 30, 2021

View reviewed changes

include/RAJA/policy/cuda/kernel/CudaKernel.hpp Show resolved Hide resolved

MrBurmark reviewed Nov 30, 2021

View reviewed changes

include/RAJA/policy/cuda/policy.hpp Outdated Show resolved Hide resolved

MrBurmark reviewed Nov 30, 2021

View reviewed changes

include/RAJA/policy/cuda/policy.hpp Outdated Show resolved Hide resolved

MrBurmark reviewed Nov 30, 2021

View reviewed changes

Comment on placeholder cuda_exec.

dfc6b9e

rchen20 added 2 commits November 30, 2021 19:18

CUDA default blocks per SM.

490a3f6

Merge branch 'develop' into task/chen59/minblocks

ad5b48d

rchen20 mentioned this pull request Dec 1, 2021

Excessive Recursion Error with RAJA LLNL/camp#91

Open

rchen20 added 2 commits December 2, 2021 14:22

Adding Teams default launch execute policy back.

7e48b0c

Merge branch 'develop' into task/chen59/minblocks

10d49d9

rhornung67 added this to the 2022.01 release milestone Dec 10, 2021

artv3 previously approved these changes Dec 21, 2021

View reviewed changes

rhornung67 and others added 3 commits January 4, 2022 09:29

Merge branch 'develop' into task/chen59/minblocks

f08ddec

Undo HIP changes, awaiting further guidance from AMD on launch_bounds.

e517a78

Merge branch 'task/chen59/minblocks' of github.com:LLNL/RAJA into tas…

1192735

…k/chen59/minblocks

rchen20 dismissed artv3’s stale review via 1192735 January 6, 2022 00:40

rchen20 added 5 commits January 5, 2022 16:45

More HIP undos, and formatting.

cfa84ae

Merge branch 'develop' into task/chen59/minblocks

ee34336

Fix unrelated typo.

f263497

Documentation for cuda_exec_explicit.

96a1635

Merge branch 'develop' into task/chen59/minblocks

c2473ff

rhornung67 reviewed Jan 6, 2022

View reviewed changes

rhornung67 previously approved these changes Jan 6, 2022

View reviewed changes

Clarify async in example.

878b113

rchen20 dismissed rhornung67’s stale review via 878b113 January 6, 2022 23:38

Satisfy NVCC const.

0090cad

artv3 approved these changes Jan 7, 2022

View reviewed changes

rhornung67 approved these changes Jan 7, 2022

View reviewed changes

Merge branch 'develop' into task/chen59/minblocks

4d457a8

rchen20 merged commit 947cd5a into develop Jan 7, 2022

rchen20 deleted the task/chen59/minblocks branch January 7, 2022 19:16

rchen20 mentioned this pull request Jan 27, 2022

Fix "exhaustive" atomic tests #1017

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Blocks per SM #1165

GPU Blocks per SM #1165

rchen20 commented Nov 19, 2021 •

edited

Loading

rchen20 Nov 30, 2021

MrBurmark Nov 30, 2021

rchen20 Nov 30, 2021

trws Nov 30, 2021

rchen20 Nov 30, 2021 •

edited

Loading

trws Nov 30, 2021

rchen20 Dec 1, 2021

rchen20 commented Nov 30, 2021

MrBurmark Nov 30, 2021

rchen20 Nov 30, 2021

MrBurmark Nov 30, 2021

rhornung67 Nov 30, 2021

MrBurmark Nov 30, 2021

rhornung67 Nov 30, 2021

MrBurmark Nov 30, 2021

trws Nov 30, 2021

trws Nov 30, 2021

MrBurmark Nov 30, 2021

MrBurmark Nov 30, 2021

rchen20 Nov 30, 2021

MrBurmark Nov 30, 2021

rchen20 Nov 30, 2021

rchen20 Dec 1, 2021

MrBurmark Dec 2, 2021

rchen20 Dec 2, 2021

artv3 Dec 21, 2021

rchen20 commented Jan 6, 2022

rhornung67 Jan 6, 2022

rhornung67 left a comment

		template <bool async>
		struct LaunchExecute<RAJA::expt::cuda_launch_t<async, 0>> {

GPU Blocks per SM #1165

GPU Blocks per SM #1165

Conversation

rchen20 commented Nov 19, 2021 • edited Loading

Summary

Design review

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rchen20 Nov 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rchen20 commented Nov 30, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rchen20 commented Jan 6, 2022

Choose a reason for hiding this comment

rhornung67 left a comment

Choose a reason for hiding this comment

rchen20 commented Nov 19, 2021 •

edited

Loading

rchen20 Nov 30, 2021 •

edited

Loading