Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for device and constant global variables in the SYCL backend #2242

Merged
merged 13 commits into from
Apr 16, 2024

Conversation

AuroraPerego
Copy link
Contributor

This tries to fix #2070.
This implementation changes the current API.
The declaration of device memory is done with the macro ALPAKA_STATIC_ACC_MEM_GLOBAL(type, name) that wraps the CUDA/HIP/serial variables in a struct alpaka::DevGlobal and declares a sycl::ext::oneapi::experimental::device_global<type> for SYCL. The inline attribute is used to ensure that only one instance of that variable exists across different translation units.
name is used for the memcpy, while name.get() must be used in the kernel to align with the behavior of the SYCL backend.
The memcpy has been specialized for the device global variables.

The test with the SYCL backend failed with the original KernelExecutionFixture because it creates a new queue instead of using the one used for the memcpy. I added a constructor that takes in input also the queue.

Another issue with the test is that being compiled with the flags to enable the SYCL backend, the macro for the device global variable expands to the SYCL one (sycl::ext::oneapi::experimental::device_global<type>) and therefore it fails when running on the AccCpuSerial. I have disabled this accelerator just for this test.
Note that this happens also with the CUDA backend (with the macro expanded to __device__ alpaka::DevGlobal<type> name), but for some strange reason it works on the serial backend.

Thanks to @fwyzard for the help :)

Copy link
Contributor

@fwyzard fwyzard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for this work !

As it introduces a breaking changes to the API, I would ask for enough time to consider all impacts before merging the changes.

include/alpaka/core/Common.hpp Outdated Show resolved Hide resolved
//!
//! struct DeviceMemoryKernel
//! {
//! ALPAKA_NO_HOST_ACC_WARNING
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why ALPAKA_NO_HOST_ACC_WARNING ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include/alpaka/core/Common.hpp Outdated Show resolved Hide resolved
//!
//! struct DeviceMemoryKernel
//! {
//! ALPAKA_NO_HOST_ACC_WARNING
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why ALPAKA_NO_HOST_ACC_WARNING ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

include/alpaka/core/Common.hpp Outdated Show resolved Hide resolved
include/alpaka/test/KernelExecutionFixture.hpp Outdated Show resolved Hide resolved
include/alpaka/test/KernelExecutionFixture.hpp Outdated Show resolved Hide resolved
include/alpaka/test/KernelExecutionFixture.hpp Outdated Show resolved Hide resolved
test/unit/mem/view/src/ViewStaticAccMem.cpp Outdated Show resolved Hide resolved
include/alpaka/core/Common.hpp Outdated Show resolved Hide resolved
Comment on lines 1 to 2
/* Copyright 2023 Axel Hübl, Benjamin Worpitz, Matthias Werner, Andrea Bocci, Jan Stephan, Bernhard Manfred Gruber,
* Aurora Perego
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/* Copyright 2023 Axel Hübl, Benjamin Worpitz, Matthias Werner, Andrea Bocci, Jan Stephan, Bernhard Manfred Gruber,
* Aurora Perego
/* Copyright 2024 Axel Hübl, Benjamin Worpitz, Matthias Werner, Andrea Bocci, Jan Stephan, Bernhard Manfred Gruber,
* Aurora Perego

@AuroraPerego AuroraPerego force-pushed the device_global branch 3 times, most recently from d596e0b to 81aa05e Compare February 23, 2024 16:27

namespace alpaka
{
using sycl::ext::oneapi::experimental::device_global;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes device_global visible to all in the alpaka namespace, which may not be a good idea.

Could you move it at least inside the detail namespace (and use it as detail::device_global below) ?

Or, just use it fully expanded everywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used it fully expanded

template<typename T>
struct DevGlobalTrait<TagGpuHipRt, T>
{
// CUDA implementation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// CUDA implementation
// HIP/ROCm implementation

Comment on lines 49 to 54

template<typename TAcc, typename TViewSrc, typename TViewDstFwd, typename TQueue>
ALPAKA_FN_HOST auto memcpy(
TQueue& queue,
alpaka::detail::DevGlobalImplGeneric<TAcc, TViewDstFwd>& viewDst,
TViewSrc const& viewSrc) -> void
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach seems too generic, as in principle it can match also for a CUDA DevGlobalImplGeneric.

Also, TAcc should be TTag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a constrain that TAcc (actually TTag) is one of TagCpuOmp2Blocks, TagCpuOmp2Threads, TagCpuSerial, TagCpuTbbBlocks, TagCpuThreads ?

There may be a smarter way, but at least this should avoid the wrong matches.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, IIUC TViewDstFwd here is not really a View, but the underlying type of the global variable.
If that's the case, could you rename it to TType ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a constrain that TAcc (actually TTag) is one of TagCpuOmp2Blocks, TagCpuOmp2Threads, TagCpuSerial, TagCpuTbbBlocks, TagCpuThreads ?

ok, I'll try (and I'll probably also have to prevent the test from running on AccCpuSerial when it is compiled with the flags to enable CUDA/HIP/SYCL)

By the way, IIUC TViewDstFwd here is not really a View, but the underlying type of the global variable. If that's the case, could you rename it to TType ?

You are right, I'll change it

Comment on lines 34 to 38
template<typename TAcc, typename TApi, bool TBlocking, typename TViewDst, typename TViewSrc>
ALPAKA_FN_HOST auto memcpy(
uniform_cuda_hip::detail::QueueUniformCudaHipRt<TApi, TBlocking>& queue,
TViewDst& viewDst,
alpaka::detail::DevGlobalImplGeneric<TAcc, TViewSrc>& viewSrc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TAcc should be TTag, it should somehoe match the TApi (we don't want to use CUDA on a HIP/ROCm global variable).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if there is a map between TTag and TApi, but I'll look into it

@AuroraPerego
Copy link
Contributor Author

Device global variables in oneAPI versions 2023.* require explicitly the device image scope in the constructor, while from 2024.0 this is not needed anymore. For this reason, the required compiler version has been changed to 2024.0.

Now the only failures in the tests are some warnings (that become errors when compiling the tests) in the SYCL headers / footers.

@fwyzard
Copy link
Contributor

fwyzard commented Mar 6, 2024

Now the only failures in the tests are some warnings (that become errors when compiling the tests) in the SYCL headers / footers.

@psychocoderHPC can we remove -Wreserved-identifier -Wold-style-cast from the flags passed to icpx ?
The offending code is autogenerated, and at least for -Wreserved-identifier is included before any user or alpaka code, so we cannot disable the warnings with a #pragma in the code.

@psychocoderHPC
Copy link
Member

Now the only failures in the tests are some warnings (that become errors when compiling the tests) in the SYCL headers / footers.

@psychocoderHPC can we remove -Wreserved-identifier -Wold-style-cast from the flags passed to icpx ? The offending code is autogenerated, and at least for -Wreserved-identifier is included before any user or alpaka code, so we cannot disable the warnings with a #pragma in the code.

offline discussed: yes we can remove the options if nessesary

alpaka::memcpy(
queueAcc,
bufHost2,
g_globalMemory2DUninitialized<typename alpaka::trait::AccToTag<Acc>::type>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer passing the Acc type directly instead using the type trait alpaka::trait::AccToTag. I think the common case is, that we have the Acc type available and not the tag type, therefore we would have a lot of copy/past code.

Depending on if it makes sense to provide the possibility to use tags in the user API, you can ether move typename alpaka::trait::AccToTag<Acc>::type to the implementation (no support for tags in the user API) or do some kind of overload and check if the given type is a alpaka accelerator type or tag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved it to the implementation

template<typename TAcc, typename T>
    using DevGlobal = typename detail::DevGlobalTrait<typename alpaka::trait::AccToTag<TAcc>::type, T>::Type;

@psychocoderHPC
Copy link
Member

@alpaka-group/alpaka-maintainers IMO we can merge this PR. Are there any voices against merging it?

@fwyzard
Copy link
Contributor

fwyzard commented Apr 12, 2024

OK for me.

Any further developments can happen in follow up PRs.

@psychocoderHPC psychocoderHPC merged commit 88c776b into alpaka-group:develop Apr 16, 2024
22 checks passed
@@ -16,92 +16,101 @@ using Elem = std::uint32_t;
using Dim = alpaka::DimInt<2u>;
using Idx = std::uint32_t;

#if !defined(ALPAKA_ACC_SYCL_ENABLED)
ALPAKA_STATIC_ACC_MEM_GLOBAL alpaka::DevGlobal<TAcc, Elem[3][2]> g_globalMemory2DUninitialized;
Copy link
Member

@psychocoderHPC psychocoderHPC Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AuroraPerego We need to review this change again because of the bug shown #2259

In the original code we had constant global memory only and in the new code we have global memory and constant memory.
With your changes is there a difference between ALPAKA_STATIC_ACC_MEM_GLOBAL and ALPAKA_STATIC_ACC_MEM_CONSTANT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO in this section we should only use ALPAKA_STATIC_ACC_MEM_CONSTANT and later we have a test for ALPAKA_STATIC_ACC_MEM_GLOBAL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original code we had constant global memory only and in the new code we have global memory and constant memory.

They were different also in the original code, with ALPAKA_STATIC_ACC_MEM_GLOBAL expanding to __device__ and ALPAKA_STATIC_ACC_MEM_CONSTANT expanding to __constant__.

With your changes is there a difference between ALPAKA_STATIC_ACC_MEM_GLOBAL and ALPAKA_STATIC_ACC_MEM_CONSTANT?

Yes, they are using the __device__ and __constant__ attributes respectively (plus the inline attribute)

IMO in this section we should only use ALPAKA_STATIC_ACC_MEM_CONSTANT and later we have a test for ALPAKA_STATIC_ACC_MEM_GLOBAL

ok, but why? we were testing both here also before this PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, but why? we were testing both here also before this PR

This is a good question.

What if we define ALPAKA_STATIC_ACC_MEM_GLOBAL static instead of inline this should create a single instance too.

Copy link
Contributor Author

@AuroraPerego AuroraPerego Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it compiles and also solves the issue in #2259, I can make a PR with the change if you think it's correct (I don't know enough about static, inline and extern to be sure of that)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, what macros should be used to support all use cases ?

file1.cu

static __device__ int i = 0;

and

file2.h

extern __device__ int i;

file2.cu

#include "file2.h"

__device__ int i = 0;

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it, maybe static is not what we want. The ideal would be to make the extern keyword work also with the new implementation to allow the file2.h/file2.cu case, which IIUC is not possible with static.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

device global variables in SYCL
5 participants