Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cooperative groups #2307

Draft
wants to merge 24 commits into
base: develop
Choose a base branch
from

Conversation

MichaelVarvarin
Copy link
Contributor

Add support for cooperative groups and related functionality

@fwyzard
Copy link
Contributor

fwyzard commented Jul 6, 2024

Looks good so far :-)

One functionality that we will need is the possibility of querying the maximum number of blocks that can be used with a given kernel, so the user can store it and use it for launching the kernel.

@MichaelVarvarin
Copy link
Contributor Author

For some reason gridSync locks up, when compiled CUDA Clang, if numberOfBlocks > 2 * multiProcessorCount


//! Hello world kernel, utilizing grid synchronization.
//! Prints hello world from a thread, performs grid sync.
//! and prints the sum of indixes of this thread and the opposite thread (the sums have to be the same).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Could you explain what is the opposite thread here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thread, that has the same distance from the end of the grid dimension, as this from the start. So, if the IDs range from 0 to 9, these are 0 and 9, 1 and 8, 2 and 7 and so on. Their sum is constant, so we can check, if grid sync was performed successfully

@mehmetyusufoglu
Copy link
Contributor

Add support for cooperative groups and related functionality

Could you add more details to the PR definition

@MichaelVarvarin
Copy link
Contributor Author

Add support for cooperative groups and related functionality

Could you add more details to the PR definition

That depends on the desired scope of the PR, I've deliberately made it vague, so we can decide, when to merge it.

@MichaelVarvarin
Copy link
Contributor Author

MichaelVarvarin commented Jul 13, 2024

For some reason gridSync locks up, when compiled CUDA Clang, if numberOfBlocks > 2 * multiProcessorCount

This looks like an upstream issue, at least locally, when compiling with clang 17.0.6 and CUDA 12.1.1 and 12.5

@fwyzard
Copy link
Contributor

fwyzard commented Jul 13, 2024

For some reason gridSync locks up, when compiled CUDA Clang, if numberOfBlocks > 2 * multiProcessorCount

This looks like an upstream issue, at least locally, when compiling with clang 17.0.6 and CUDA 12.1.1 and 12.5

Is it supposed to work?
What is the maximum number of concurrent blocks that can be used cooperatively with this kernel?

@MichaelVarvarin
Copy link
Contributor Author

MichaelVarvarin commented Jul 13, 2024

For some reason gridSync locks up, when compiled CUDA Clang, if numberOfBlocks > 2 * multiProcessorCount

This looks like an upstream issue, at least locally, when compiling with clang 17.0.6 and CUDA 12.1.1 and 12.5

Is it supposed to work? What is the maximum number of concurrent blocks that can be used cooperatively with this kernel?

Yes, it is, maximum number reported is 16 * multiProcessorCount, and the kernel refuses to launch, on both nvcc and clang if this number is exceeded.

@MichaelVarvarin MichaelVarvarin force-pushed the cooperative-groups branch 2 times, most recently from bb5ccaa to 9614d9c Compare July 26, 2024 12:35
@fwyzard
Copy link
Contributor

fwyzard commented Jul 26, 2024

Add cooperative kernel launch and grid sync support for HIP

Nice 👍🏻

@MichaelVarvarin
Copy link
Contributor Author

MichaelVarvarin commented Jul 26, 2024

Add cooperative kernel launch and grid sync support for HIP

Nice 👍🏻

Doesn't actually work, unfortunately, I will investigate it. Would be funny, if I find a second compiler bug
Upd. It does, it was a hardware issue.

@MichaelVarvarin MichaelVarvarin force-pushed the cooperative-groups branch 2 times, most recently from 36c2af3 to 334cea9 Compare August 11, 2024 13:17
include/alpaka/alpaka.hpp Outdated Show resolved Hide resolved
Comment on lines +257 to +267
static inline Error_t launchCooperativeKernel(
void const* func,
dim3 gridDim,
dim3 blockDim,
void** args,
size_t sharedMem,
Stream_t stream)
{
return ::cudaLaunchCooperativeKernel(func, gridDim, blockDim, args, sharedMem, stream);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you change this to be templated on the func argument ?

Suggested change
static inline Error_t launchCooperativeKernel(
void const* func,
dim3 gridDim,
dim3 blockDim,
void** args,
size_t sharedMem,
Stream_t stream)
{
return ::cudaLaunchCooperativeKernel(func, gridDim, blockDim, args, sharedMem, stream);
}
template <typename TFunc>
static inline Error_t launchCooperativeKernel(
TFunc func,
dim3 gridDim,
dim3 blockDim,
void** args,
size_t sharedMem,
Stream_t stream)
{
return ::cudaLaunchCooperativeKernel(func, gridDim, blockDim, args, sharedMem, stream);
}

Same for the HIP implementation.

void const* kernelArgs[] = {&threadElemExtent, &task.m_kernelFnObj, &args...};

TApi::launchCooperativeKernel(
reinterpret_cast<void*>(kernelName),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check if the cast can be removed, after implementing https://github.com/alpaka-group/alpaka/pull/2307/files#r1714986927 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you now have to do the same cast inside of launchCooperativeKernel

@SimeonEhrig SimeonEhrig marked this pull request as ready for review September 10, 2024 06:22
@psychocoderHPC psychocoderHPC marked this pull request as draft September 10, 2024 08:14
@SimeonEhrig
Copy link
Member

@MichaelVarvarin Sorry for removing the draft state. I thought I start the GitHub Action jobs. Not sure what was going wrong.

@psychocoderHPC psychocoderHPC added this to the 2.0.0 milestone Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants