-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cooperative groups #2307
base: develop
Are you sure you want to change the base?
Cooperative groups #2307
Conversation
Looks good so far :-) One functionality that we will need is the possibility of querying the maximum number of blocks that can be used with a given kernel, so the user can store it and use it for launching the kernel. |
For some reason gridSync locks up, when compiled CUDA Clang, if numberOfBlocks > 2 * multiProcessorCount |
|
||
//! Hello world kernel, utilizing grid synchronization. | ||
//! Prints hello world from a thread, performs grid sync. | ||
//! and prints the sum of indixes of this thread and the opposite thread (the sums have to be the same). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] Could you explain what is the opposite thread here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thread, that has the same distance from the end of the grid dimension, as this from the start. So, if the IDs range from 0 to 9, these are 0 and 9, 1 and 8, 2 and 7 and so on. Their sum is constant, so we can check, if grid sync was performed successfully
Could you add more details to the PR definition |
That depends on the desired scope of the PR, I've deliberately made it vague, so we can decide, when to merge it. |
This looks like an upstream issue, at least locally, when compiling with clang 17.0.6 and CUDA 12.1.1 and 12.5 |
Is it supposed to work? |
Yes, it is, maximum number reported is 16 * multiProcessorCount, and the kernel refuses to launch, on both nvcc and clang if this number is exceeded. |
bb5ccaa
to
9614d9c
Compare
Nice 👍🏻 |
Doesn't actually work, unfortunately, I will investigate it. Would be funny, if I find a second compiler bug |
26c4803
to
565c944
Compare
36c2af3
to
334cea9
Compare
…nching the specified cooperative kernel
334cea9
to
4ad8bae
Compare
static inline Error_t launchCooperativeKernel( | ||
void const* func, | ||
dim3 gridDim, | ||
dim3 blockDim, | ||
void** args, | ||
size_t sharedMem, | ||
Stream_t stream) | ||
{ | ||
return ::cudaLaunchCooperativeKernel(func, gridDim, blockDim, args, sharedMem, stream); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you change this to be templated on the func
argument ?
static inline Error_t launchCooperativeKernel( | |
void const* func, | |
dim3 gridDim, | |
dim3 blockDim, | |
void** args, | |
size_t sharedMem, | |
Stream_t stream) | |
{ | |
return ::cudaLaunchCooperativeKernel(func, gridDim, blockDim, args, sharedMem, stream); | |
} | |
template <typename TFunc> | |
static inline Error_t launchCooperativeKernel( | |
TFunc func, | |
dim3 gridDim, | |
dim3 blockDim, | |
void** args, | |
size_t sharedMem, | |
Stream_t stream) | |
{ | |
return ::cudaLaunchCooperativeKernel(func, gridDim, blockDim, args, sharedMem, stream); | |
} | |
Same for the HIP implementation.
void const* kernelArgs[] = {&threadElemExtent, &task.m_kernelFnObj, &args...}; | ||
|
||
TApi::launchCooperativeKernel( | ||
reinterpret_cast<void*>(kernelName), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check if the cast can be removed, after implementing https://github.com/alpaka-group/alpaka/pull/2307/files#r1714986927 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you now have to do the same cast inside of launchCooperativeKernel
…td::threads accelerator
@MichaelVarvarin Sorry for removing the draft state. I thought I start the GitHub Action jobs. Not sure what was going wrong. |
Add cooperative groups and gridSync to SYCL
Add support for cooperative groups and related functionality