-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SM90 Support #126
base: sycl-develop
Are you sure you want to change the base?
SM90 Support #126
Conversation
cmake/FindDPCPP.cmake
Outdated
@@ -51,6 +51,8 @@ endif() | |||
|
|||
if(NOT "${DPCPP_SYCL_ARCH}" STREQUAL "") | |||
if("${DPCPP_SYCL_TARGET}" STREQUAL "nvptx64-nvidia-cuda") | |||
list(APPEND DPCPP_FLAGS "-fno-sycl-decompose-functor;") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to have comments behind any non-obvious flag to understand in the feature why we added certain flags.
@@ -38,7 +38,7 @@ find_library(DPCPP_LIB_DIR NAMES sycl sycl6 PATHS "${DPCPP_BIN_DIR}/../lib") | |||
|
|||
add_library(DPCPP::DPCPP INTERFACE IMPORTED) | |||
|
|||
set(DPCPP_FLAGS "-fsycl;") | |||
set(DPCPP_FLAGS "-fsycl;-mllvm;-enable-global-offset=false;") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This flag was moved to line 58
list(APPEND DPCPP_COMPILE_ONLY_FLAGS; "-mllvm;-enable-global-offset=false;")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a TODO comment which I thought I had added as a part of 0c9d5e1 over here,
which basically was basically about investigating why this line is needed,
I was aware of this change, but for some reason I was still seeing a kernel *_with_offset
, hence I added that as a temporary fix,
This is also partly the reason why this PR is draft
#if defined(SYCL_NVIDIA_TARGET) | ||
using namespace cutlass; | ||
#endif | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because types like cudaError_t and cudaSuccess are defined in the cutlass namespace in the non cuda path
|
||
|
||
cutlass_example_add_executable( | ||
48_hopper_warp_specialized_gemm | ||
48_hopper_warp_specialized_gemm.cu | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert?
examples/CMakeLists.txt
Outdated
|
||
if (DPCPP_SYCL_ARCH STREQUAL "sm_90a") | ||
SET(ADD_CUDA ON) | ||
endif() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For context: this is needed to call the function that initialise the TMA descriptor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to put the comments there for future reference
include/cute/arch/cluster_sm90.hpp
Outdated
((__CUDACC_VER_MAJOR__ >= 12) || ((__CUDACC_VER_MAJOR__ == 11) && (__CUDACC_VER_MINOR__ >= 8)))) | ||
((__CUDACC_VER_MAJOR__ >= 12) || ((__CUDACC_VER_MAJOR__ == 11) && (__CUDACC_VER_MINOR__ >= 8)))) || \ | ||
(defined(__SYCL_CUDA_ARCH__) && (__SYCL_CUDA_ARCH__ >= 900)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use the __PTX_VERSION__
instead?
// Copy from global to shared::cluster. | ||
// Copy from global to shared::cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert?
&tma_desc, | ||
reinterpret_cast<CUtensorMap*>(&tma_desc), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the CuTensorMapEncodeTiled accepts a pointer to CUtensorMap,
it otherwise leads to a compilation error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must clarify that this change is only temporary,
till we have the tensor map initialization via SYCL support
include/cutlass/cutlass.h
Outdated
#else | ||
return 0; | ||
#endif | ||
return shfl_sync(0xffffffff, ThreadIdxX() / NumThreadsPerWarp, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return shfl_sync(0xffffffff, ThreadIdxX() / NumThreadsPerWarp, 0); | |
return shfl_sync(0xffffffff, ThreadIdxX() / NumThreadsPerWarp, 0); |
#if defined(CUTLASS_ENABLE_SYCL) | ||
#include <syclcompat/syclcompat.hpp> | ||
namespace sc = syclcompat; | ||
#endif | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't syclcompat already included?
@@ -84,15 +86,15 @@ warpgroup_fence_operand(uint32_t& reg) { | |||
// MSVC emits a build error for 'asm volatile' | |||
// even if it only occurs in a __device__ function. | |||
// This prevents the error. | |||
#if defined(__CUDA_ARCH__) | |||
#if defined(__CUDA_ARCH__) || defined(__SYCL_CUDA_ARCH__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This SYCL_CUDA_ARCH seems to create a lot of noise in the code can we we wrap it up with cuda_arch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we cannot do that yet.
So there is a cuda compatibility flag planned (-fsycl-cuda-compatibility), which will define the CUDA_ARCH and more, but those will be a bit more involved changes, as there is still a lot of nvcc specific code which currently shielded by the CUDA_ARCH, namely nvcc intrinsics which would not pertain to the functionality added in this PR, but would come as a part of a later PR
@@ -762,7 +762,7 @@ print_latex_copy(LayoutS const& S, ThrIDS const& TS, // (m,n) -> (tid,vid) and | |||
#include <cute/atom/copy_traits_sm90.hpp> | |||
|
|||
// Config | |||
#if (__CUDACC_VER_MAJOR__ >= 12) | |||
#if (__CUDACC_VER_MAJOR__ >= 12) || defined(SYCL_NVIDIA_TARGET) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use PTX version for SYCL instead of SYCL_NVIDIA_TARGET. Since SYCL_NVIDIA_TARGET is more generic than versioning
#define CUTLASS_ARCH_MMA_SM90_ENABLED | ||
#endif | ||
#endif | ||
#endif | ||
|
||
#if ((__CUDACC_VER_MAJOR__ > 12) || ((__CUDACC_VER_MAJOR__ == 12) && (__CUDACC_VER_MINOR__ >= 3))) | ||
#if ((__CUDACC_VER_MAJOR__ > 12) || ((__CUDACC_VER_MAJOR__ == 12) && (__CUDACC_VER_MINOR__ >= 3))) || \ | ||
defined(SYCL_NVIDIA_TARGET) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here as well, the SYCL_NVIDIA_TARGET covers a wide range of targets including SM80. we need to use PTX version here or at least make sure that the Nvidia target >= 900
@@ -33,7 +33,7 @@ | |||
#include "cutlass/conv/collective/builders/sm90_common.inl" | |||
|
|||
// SM90 Collective Builders should be used only starting CUDA 12.0 | |||
#if (__CUDACC_VER_MAJOR__ >= 12) | |||
#if (__CUDACC_VER_MAJOR__ >= 12) || defined(SYCL__NVIDIA_TARGET) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
@@ -33,7 +33,7 @@ | |||
#include "cutlass/gemm/collective/builders/sm90_common.inl" | |||
|
|||
// SM90 Collective Builders should be used only starting CUDA 12.0 | |||
#if (__CUDACC_VER_MAJOR__ >= 12) | |||
#if (__CUDACC_VER_MAJOR__ >= 12) || (SYCL_NVIDIA_TARGET) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
@@ -961,7 +961,8 @@ static constexpr bool OnlyOneIsTuple = cute::is_tuple<ElementA>::value ^ cute::i | |||
static constexpr bool IsDifferentWidth = sizeof_bits<ExtractedElementA>::value != sizeof_bits<ExtractedElementB>::value; | |||
static constexpr bool IsMixedWidthInput = IsDifferentWidth || (IsDifferentWidth && OnlyOneIsTuple); | |||
|
|||
#if ((__CUDACC_VER_MAJOR__ > 12) || ((__CUDACC_VER_MAJOR__ == 12) && (__CUDACC_VER_MINOR__ >= 1))) | |||
#if ((__CUDACC_VER_MAJOR__ > 12) || ((__CUDACC_VER_MAJOR__ == 12) && (__CUDACC_VER_MINOR__ >= 1))) || \ | |||
defined(SYCL_NVIDIA_TARGET) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here try to use specific PTX versioning SYCL provide
constexpr bool is_static_1x1x1 = cute::is_static_v<typename GemmKernel::DispatchPolicy::ClusterShape> and | ||
cute::size(typename GemmKernel::DispatchPolicy::ClusterShape{}) == 1; | ||
dim3 cluster(cute::size<0>(typename GemmKernel::DispatchPolicy::ClusterShape{}), | ||
cute::size<1>(typename GemmKernel::DispatchPolicy::ClusterShape{}), | ||
cute::size<2>(typename GemmKernel::DispatchPolicy::ClusterShape{})); | ||
void* kernel_params[] = {¶ms}; | ||
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noise
…noise from PR" This reverts commit d3d97ab.
Co-authored-by: Mehdi Goli <[email protected]>
I have selectively applied the |
Adds SM 90 support for GEMM, and enables example 48
This PR wait on the following compiler features (hence draft)