Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SM90 Support #126

Draft
wants to merge 25 commits into
base: sycl-develop
Choose a base branch
from
Draft

Conversation

AD2605
Copy link
Collaborator

@AD2605 AD2605 commented Aug 26, 2024

Adds SM 90 support for GEMM, and enables example 48

This PR wait on the following compiler features (hence draft)

  • Workgroup static extension
  • TensorMap data structure and initialization via SYCL
  • Launch property for .maxntid (hence using the explicit queue.submit for now)

@@ -51,6 +51,8 @@ endif()

if(NOT "${DPCPP_SYCL_ARCH}" STREQUAL "")
if("${DPCPP_SYCL_TARGET}" STREQUAL "nvptx64-nvidia-cuda")
list(APPEND DPCPP_FLAGS "-fno-sycl-decompose-functor;")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to have comments behind any non-obvious flag to understand in the feature why we added certain flags.

include/cutlass/arch/memory.h Show resolved Hide resolved
@@ -38,7 +38,7 @@ find_library(DPCPP_LIB_DIR NAMES sycl sycl6 PATHS "${DPCPP_BIN_DIR}/../lib")

add_library(DPCPP::DPCPP INTERFACE IMPORTED)

set(DPCPP_FLAGS "-fsycl;")
set(DPCPP_FLAGS "-fsycl;-mllvm;-enable-global-offset=false;")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flag was moved to line 58

 list(APPEND DPCPP_COMPILE_ONLY_FLAGS; "-mllvm;-enable-global-offset=false;")

Copy link
Collaborator Author

@AD2605 AD2605 Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a TODO comment which I thought I had added as a part of 0c9d5e1 over here,
which basically was basically about investigating why this line is needed,

I was aware of this change, but for some reason I was still seeing a kernel *_with_offset, hence I added that as a temporary fix,
This is also partly the reason why this PR is draft

Comment on lines +89 to +92
#if defined(SYCL_NVIDIA_TARGET)
using namespace cutlass;
#endif

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because types like cudaError_t and cudaSuccess are defined in the cutlass namespace in the non cuda path

Comment on lines 30 to 33


cutlass_example_add_executable(
48_hopper_warp_specialized_gemm
48_hopper_warp_specialized_gemm.cu
)
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert?

Comment on lines 54 to 57

if (DPCPP_SYCL_ARCH STREQUAL "sm_90a")
SET(ADD_CUDA ON)
endif()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context: this is needed to call the function that initialise the TMA descriptor

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to put the comments there for future reference

Comment on lines 37 to 38
((__CUDACC_VER_MAJOR__ >= 12) || ((__CUDACC_VER_MAJOR__ == 11) && (__CUDACC_VER_MINOR__ >= 8))))
((__CUDACC_VER_MAJOR__ >= 12) || ((__CUDACC_VER_MAJOR__ == 11) && (__CUDACC_VER_MINOR__ >= 8)))) || \
(defined(__SYCL_CUDA_ARCH__) && (__SYCL_CUDA_ARCH__ >= 900))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the __PTX_VERSION__ instead?

intel/llvm#14621 (comment)

// Copy from global to shared::cluster.
// Copy from global to shared::cluster
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert?

Comment on lines -987 to +988
&tma_desc,
reinterpret_cast<CUtensorMap*>(&tma_desc),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the CuTensorMapEncodeTiled accepts a pointer to CUtensorMap,
it otherwise leads to a compilation error

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must clarify that this change is only temporary,
till we have the tensor map initialization via SYCL support

#else
return 0;
#endif
return shfl_sync(0xffffffff, ThreadIdxX() / NumThreadsPerWarp, 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return shfl_sync(0xffffffff, ThreadIdxX() / NumThreadsPerWarp, 0);
return shfl_sync(0xffffffff, ThreadIdxX() / NumThreadsPerWarp, 0);

Comment on lines +48 to +52
#if defined(CUTLASS_ENABLE_SYCL)
#include <syclcompat/syclcompat.hpp>
namespace sc = syclcompat;
#endif

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't syclcompat already included?

@@ -84,15 +86,15 @@ warpgroup_fence_operand(uint32_t& reg) {
// MSVC emits a build error for 'asm volatile'
// even if it only occurs in a __device__ function.
// This prevents the error.
#if defined(__CUDA_ARCH__)
#if defined(__CUDA_ARCH__) || defined(__SYCL_CUDA_ARCH__)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This SYCL_CUDA_ARCH seems to create a lot of noise in the code can we we wrap it up with cuda_arch

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we cannot do that yet.
So there is a cuda compatibility flag planned (-fsycl-cuda-compatibility), which will define the CUDA_ARCH and more, but those will be a bit more involved changes, as there is still a lot of nvcc specific code which currently shielded by the CUDA_ARCH, namely nvcc intrinsics which would not pertain to the functionality added in this PR, but would come as a part of a later PR

@@ -762,7 +762,7 @@ print_latex_copy(LayoutS const& S, ThrIDS const& TS, // (m,n) -> (tid,vid) and
#include <cute/atom/copy_traits_sm90.hpp>

// Config
#if (__CUDACC_VER_MAJOR__ >= 12)
#if (__CUDACC_VER_MAJOR__ >= 12) || defined(SYCL_NVIDIA_TARGET)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use PTX version for SYCL instead of SYCL_NVIDIA_TARGET. Since SYCL_NVIDIA_TARGET is more generic than versioning

#define CUTLASS_ARCH_MMA_SM90_ENABLED
#endif
#endif
#endif

#if ((__CUDACC_VER_MAJOR__ > 12) || ((__CUDACC_VER_MAJOR__ == 12) && (__CUDACC_VER_MINOR__ >= 3)))
#if ((__CUDACC_VER_MAJOR__ > 12) || ((__CUDACC_VER_MAJOR__ == 12) && (__CUDACC_VER_MINOR__ >= 3))) || \
defined(SYCL_NVIDIA_TARGET)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here as well, the SYCL_NVIDIA_TARGET covers a wide range of targets including SM80. we need to use PTX version here or at least make sure that the Nvidia target >= 900

@@ -33,7 +33,7 @@
#include "cutlass/conv/collective/builders/sm90_common.inl"

// SM90 Collective Builders should be used only starting CUDA 12.0
#if (__CUDACC_VER_MAJOR__ >= 12)
#if (__CUDACC_VER_MAJOR__ >= 12) || defined(SYCL__NVIDIA_TARGET)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@@ -33,7 +33,7 @@
#include "cutlass/gemm/collective/builders/sm90_common.inl"

// SM90 Collective Builders should be used only starting CUDA 12.0
#if (__CUDACC_VER_MAJOR__ >= 12)
#if (__CUDACC_VER_MAJOR__ >= 12) || (SYCL_NVIDIA_TARGET)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@@ -961,7 +961,8 @@ static constexpr bool OnlyOneIsTuple = cute::is_tuple<ElementA>::value ^ cute::i
static constexpr bool IsDifferentWidth = sizeof_bits<ExtractedElementA>::value != sizeof_bits<ExtractedElementB>::value;
static constexpr bool IsMixedWidthInput = IsDifferentWidth || (IsDifferentWidth && OnlyOneIsTuple);

#if ((__CUDACC_VER_MAJOR__ > 12) || ((__CUDACC_VER_MAJOR__ == 12) && (__CUDACC_VER_MINOR__ >= 1)))
#if ((__CUDACC_VER_MAJOR__ > 12) || ((__CUDACC_VER_MAJOR__ == 12) && (__CUDACC_VER_MINOR__ >= 1))) || \
defined(SYCL_NVIDIA_TARGET)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here try to use specific PTX versioning SYCL provide

constexpr bool is_static_1x1x1 = cute::is_static_v<typename GemmKernel::DispatchPolicy::ClusterShape> and
cute::size(typename GemmKernel::DispatchPolicy::ClusterShape{}) == 1;
dim3 cluster(cute::size<0>(typename GemmKernel::DispatchPolicy::ClusterShape{}),
cute::size<1>(typename GemmKernel::DispatchPolicy::ClusterShape{}),
cute::size<2>(typename GemmKernel::DispatchPolicy::ClusterShape{}));
void* kernel_params[] = {&params};

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noise

@AD2605
Copy link
Collaborator Author

AD2605 commented Aug 27, 2024

I have selectively applied the __PTX_VERSION__ macro, because it is not defined on host.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants