SM90 Support #126

AD2605 · 2024-08-26T07:32:49Z

Adds SM 90 support for GEMM, and enables example 48

This PR wait on the following compiler features (hence draft)

Workgroup static extension
TensorMap data structure and initialization via SYCL
Launch property for .maxntid (hence using the explicit queue.submit for now)

rolandschulz · 2024-08-26T15:37:49Z

cmake/FindDPCPP.cmake

@@ -51,6 +51,8 @@ endif()

 if(NOT "${DPCPP_SYCL_ARCH}" STREQUAL "")
  if("${DPCPP_SYCL_TARGET}" STREQUAL "nvptx64-nvidia-cuda")
+    list(APPEND DPCPP_FLAGS "-fno-sycl-decompose-functor;")


Would be good to have comments behind any non-obvious flag to understand in the feature why we added certain flags.

include/cutlass/arch/memory.h

aacostadiaz · 2024-08-26T16:30:18Z

cmake/FindDPCPP.cmake

@@ -38,7 +38,7 @@ find_library(DPCPP_LIB_DIR NAMES sycl sycl6 PATHS "${DPCPP_BIN_DIR}/../lib")

 add_library(DPCPP::DPCPP INTERFACE IMPORTED)

-set(DPCPP_FLAGS "-fsycl;")
+set(DPCPP_FLAGS "-fsycl;-mllvm;-enable-global-offset=false;")


This flag was moved to line 58

list(APPEND DPCPP_COMPILE_ONLY_FLAGS; "-mllvm;-enable-global-offset=false;")

There was a TODO comment which I thought I had added as a part of 0c9d5e1 over here,
which basically was basically about investigating why this line is needed,

I was aware of this change, but for some reason I was still seeing a kernel *_with_offset, hence I added that as a temporary fix,
This is also partly the reason why this PR is draft

aacostadiaz · 2024-08-26T16:33:29Z

examples/48_hopper_warp_specialized_gemm/48_hopper_warp_specialized_gemm.cu

+#if defined(SYCL_NVIDIA_TARGET)
+using namespace cutlass;
+#endif
+


Why is this needed?

because types like cudaError_t and cudaSuccess are defined in the cutlass namespace in the non cuda path

aacostadiaz · 2024-08-26T16:34:01Z

examples/48_hopper_warp_specialized_gemm/CMakeLists.txt

-
-
 cutlass_example_add_executable(
  48_hopper_warp_specialized_gemm
  48_hopper_warp_specialized_gemm.cu
-  )
+)


aacostadiaz · 2024-08-26T16:36:09Z

examples/CMakeLists.txt

+
+  if (DPCPP_SYCL_ARCH STREQUAL "sm_90a")
+    SET(ADD_CUDA ON)
+  endif()


For context: this is needed to call the function that initialise the TMA descriptor

Better to put the comments there for future reference

aacostadiaz · 2024-08-26T16:38:07Z

include/cute/arch/cluster_sm90.hpp

-  ((__CUDACC_VER_MAJOR__ >= 12) || ((__CUDACC_VER_MAJOR__ == 11) && (__CUDACC_VER_MINOR__ >= 8))))
+  ((__CUDACC_VER_MAJOR__ >= 12) || ((__CUDACC_VER_MAJOR__ == 11) && (__CUDACC_VER_MINOR__ >= 8)))) || \
+  (defined(__SYCL_CUDA_ARCH__) && (__SYCL_CUDA_ARCH__ >= 900))


Can we use the __PTX_VERSION__ instead?

intel/llvm#14621 (comment)

aacostadiaz · 2024-08-26T16:39:16Z

include/cute/arch/copy_sm90_tma.hpp

-    // Copy from global to shared::cluster.
+    // Copy from global to shared::cluster


aacostadiaz · 2024-08-26T16:46:05Z

include/cute/atom/copy_traits_sm90_tma.hpp

-        &tma_desc,
+        reinterpret_cast<CUtensorMap*>(&tma_desc),


Is this needed?

Yes, the CuTensorMapEncodeTiled accepts a pointer to CUtensorMap,
it otherwise leads to a compilation error

I must clarify that this change is only temporary,
till we have the tensor map initialization via SYCL support

aacostadiaz · 2024-08-26T16:47:43Z

include/cutlass/cutlass.h

-  #else
-    return 0;
-  #endif
+    return shfl_sync(0xffffffff, ThreadIdxX() / NumThreadsPerWarp, 0);


Suggested change

return shfl_sync(0xffffffff, ThreadIdxX() / NumThreadsPerWarp, 0);

return shfl_sync(0xffffffff, ThreadIdxX() / NumThreadsPerWarp, 0);

aacostadiaz · 2024-08-26T16:57:19Z

tools/util/include/cutlass/util/reference/device/gemm.h

+#if defined(CUTLASS_ENABLE_SYCL)
+#include <syclcompat/syclcompat.hpp>
+namespace sc = syclcompat;
+#endif
+


Isn't syclcompat already included?

examples/CMakeLists.txt

mehdi-goli · 2024-08-27T09:39:55Z

include/cute/arch/mma_sm90_gmma.hpp

@@ -84,15 +86,15 @@ warpgroup_fence_operand(uint32_t& reg) {
  // MSVC emits a build error for 'asm volatile'
  // even if it only occurs in a __device__ function.
  // This prevents the error.
-#if defined(__CUDA_ARCH__)
+#if defined(__CUDA_ARCH__) || defined(__SYCL_CUDA_ARCH__)


This SYCL_CUDA_ARCH seems to create a lot of noise in the code can we we wrap it up with cuda_arch

No, we cannot do that yet.
So there is a cuda compatibility flag planned (-fsycl-cuda-compatibility), which will define the CUDA_ARCH and more, but those will be a bit more involved changes, as there is still a lot of nvcc specific code which currently shielded by the CUDA_ARCH, namely nvcc intrinsics which would not pertain to the functionality added in this PR, but would come as a part of a later PR

mehdi-goli · 2024-08-27T09:42:14Z

include/cute/atom/copy_atom.hpp

@@ -762,7 +762,7 @@ print_latex_copy(LayoutS const& S, ThrIDS const& TS,  // (m,n) -> (tid,vid)  and
 #include <cute/atom/copy_traits_sm90.hpp>

 // Config
-#if (__CUDACC_VER_MAJOR__ >= 12)
+#if (__CUDACC_VER_MAJOR__ >= 12) || defined(SYCL_NVIDIA_TARGET)


Can we use PTX version for SYCL instead of SYCL_NVIDIA_TARGET. Since SYCL_NVIDIA_TARGET is more generic than versioning

mehdi-goli · 2024-08-27T09:44:26Z

include/cutlass/arch/mma_sm90.h

      #define CUTLASS_ARCH_MMA_SM90_ENABLED
    #endif
  #endif
 #endif

-#if ((__CUDACC_VER_MAJOR__ > 12) || ((__CUDACC_VER_MAJOR__ == 12) && (__CUDACC_VER_MINOR__ >= 3)))
+#if ((__CUDACC_VER_MAJOR__ > 12) || ((__CUDACC_VER_MAJOR__ == 12) && (__CUDACC_VER_MINOR__ >= 3))) || \
+    defined(SYCL_NVIDIA_TARGET)


Here as well, the SYCL_NVIDIA_TARGET covers a wide range of targets including SM80. we need to use PTX version here or at least make sure that the Nvidia target >= 900

mehdi-goli · 2024-08-27T09:45:04Z

include/cutlass/conv/collective/builders/sm90_gmma_builder.inl

@@ -33,7 +33,7 @@
 #include "cutlass/conv/collective/builders/sm90_common.inl"

 // SM90 Collective Builders should be used only starting CUDA 12.0
-#if (__CUDACC_VER_MAJOR__ >= 12)
+#if (__CUDACC_VER_MAJOR__ >= 12) || defined(SYCL__NVIDIA_TARGET)


mehdi-goli · 2024-08-27T09:45:31Z

include/cutlass/gemm/collective/builders/sm90_gmma_builder.inl

@@ -33,7 +33,7 @@
 #include "cutlass/gemm/collective/builders/sm90_common.inl"

 // SM90 Collective Builders should be used only starting CUDA 12.0
-#if (__CUDACC_VER_MAJOR__ >= 12)
+#if (__CUDACC_VER_MAJOR__ >= 12) || (SYCL_NVIDIA_TARGET)


mehdi-goli · 2024-08-27T09:46:13Z

include/cutlass/gemm/collective/builders/sm90_gmma_builder.inl

@@ -961,7 +961,8 @@ static constexpr bool OnlyOneIsTuple = cute::is_tuple<ElementA>::value ^ cute::i
 static constexpr bool IsDifferentWidth = sizeof_bits<ExtractedElementA>::value != sizeof_bits<ExtractedElementB>::value;
 static constexpr bool IsMixedWidthInput = IsDifferentWidth || (IsDifferentWidth && OnlyOneIsTuple);

-#if ((__CUDACC_VER_MAJOR__ > 12) || ((__CUDACC_VER_MAJOR__ == 12) && (__CUDACC_VER_MINOR__ >= 1)))
+#if ((__CUDACC_VER_MAJOR__ > 12) || ((__CUDACC_VER_MAJOR__ == 12) && (__CUDACC_VER_MINOR__ >= 1))) || \
+    defined(SYCL_NVIDIA_TARGET)


Same here try to use specific PTX versioning SYCL provide

mehdi-goli · 2024-08-27T09:46:36Z

include/cutlass/gemm/device/gemm_universal_adapter.h

      constexpr bool is_static_1x1x1 = cute::is_static_v<typename GemmKernel::DispatchPolicy::ClusterShape> and
                                       cute::size(typename GemmKernel::DispatchPolicy::ClusterShape{}) == 1;
      dim3 cluster(cute::size<0>(typename GemmKernel::DispatchPolicy::ClusterShape{}),
                   cute::size<1>(typename GemmKernel::DispatchPolicy::ClusterShape{}),
                   cute::size<2>(typename GemmKernel::DispatchPolicy::ClusterShape{}));
      void* kernel_params[] = {&params};
-
+      


…om PR

…noise from PR" This reverts commit d3d97ab.

Co-authored-by: Mehdi Goli <[email protected]>

AD2605 · 2024-08-27T15:04:33Z

I have selectively applied the __PTX_VERSION__ macro, because it is not defined on host.

AD2605 added 18 commits August 20, 2024 08:12

initial enablement of SM90 macros

63782d4

resolve remaining compilation issues

bb1a825

enable a few more macros

d5975e8

try by setting __CUDA_ARCH_FEAT_SM90_ALL via cmake

9ebafdc

enable path in gmma_builder.inl

d49a52e

minor bug fix

d31c736

set __CUDA_ARCH_FEAT_SM90_ALL via CMakeLists.txt

546934c

fix mistake in FindDPCPP.cmake

e1f5516

remove manual setting of sm_90a macro after bug-fix in FindDPCPP.cmake

9495689

reduce scope for warpgroup_reg_alloc/dealloc

21fc904

add event from cluster launch to event manager

628ecb4

enable some pathways in epilogue folders

3206cea

enable error reporting in cuda path

2ca58ae

remove .cpp example file, use .cu instead

5e3802f

add a forgotten __SYCL_CUDA_ARCH__ in static_tile_scheduler.hpp

e0f5a3e

some missed cuda archs in cutlass.h

f88ea00

add a runtime error, and a TODO comment

0c9d5e1

fix issues with pvc build

d1badb7

rolandschulz reviewed Aug 26, 2024

View reviewed changes

aacostadiaz reviewed Aug 26, 2024

View reviewed changes

mehdi-goli reviewed Aug 27, 2024

View reviewed changes

examples/CMakeLists.txt Show resolved Hide resolved

mehdi-goli reviewed Aug 27, 2024

View reviewed changes

AD2605 added 2 commits August 27, 2024 12:23

use __PTX_VERSION__ macro, add comments in cmake, and remove noise fr…

d3d97ab

…om PR

Revert "use __PTX_VERSION__ macro, add comments in cmake, and remove …

c9e395e

…noise from PR" This reverts commit d3d97ab.

AD2605 and others added 5 commits August 27, 2024 15:03

move initialization out of timing iterations

73b6c7d

selectively use __PTX_VERSION__ macro

dc56737

disable cluster launch if using icpx

3798c3a

remove mistakenly added spaces and new lines from the PR

83f0547

Add comment as to why cuda in included in sm_90 SYCL path for now

386b14b

Co-authored-by: Mehdi Goli <[email protected]>

mgrabban mentioned this pull request Sep 5, 2024

[FEA] Need gemm support on SM90 #134

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SM90 Support #126

SM90 Support #126

AD2605 commented Aug 26, 2024 •

edited

Loading

rolandschulz Aug 26, 2024

aacostadiaz Aug 26, 2024

AD2605 Aug 26, 2024 •

edited

Loading

aacostadiaz Aug 26, 2024

AD2605 Aug 26, 2024

aacostadiaz Aug 26, 2024

aacostadiaz Aug 26, 2024

mehdi-goli Aug 27, 2024

aacostadiaz Aug 26, 2024

aacostadiaz Aug 26, 2024

aacostadiaz Aug 26, 2024

AD2605 Aug 26, 2024

AD2605 Aug 27, 2024

aacostadiaz Aug 26, 2024

aacostadiaz Aug 26, 2024

mehdi-goli Aug 27, 2024

AD2605 Aug 27, 2024

mehdi-goli Aug 27, 2024

mehdi-goli Aug 27, 2024

mehdi-goli Aug 27, 2024

mehdi-goli Aug 27, 2024

mehdi-goli Aug 27, 2024

mehdi-goli Aug 27, 2024

AD2605 commented Aug 27, 2024

		// Copy from global to shared::cluster.
		// Copy from global to shared::cluster

	return shfl_sync(0xffffffff, ThreadIdxX() / NumThreadsPerWarp, 0);
	return shfl_sync(0xffffffff, ThreadIdxX() / NumThreadsPerWarp, 0);

SM90 Support #126

Are you sure you want to change the base?

SM90 Support #126

Conversation

AD2605 commented Aug 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AD2605 Aug 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AD2605 commented Aug 27, 2024

AD2605 commented Aug 26, 2024 •

edited

Loading

AD2605 Aug 26, 2024 •

edited

Loading