Implement SplitK and StreamK algorithm for Intel PVC #132

muhammad-tanvir-1211 · 2024-09-02T16:25:13Z

This PR adds the required code for implementing the SplitK and StreamK work distribution algorithms for Intel PVC.

…tlass-fork into intel_pvc_streamk

…t calculation

* Need to fix splitK for batch > 1

AD2605 · 2024-09-06T09:39:41Z

Could we also add the collective builder struct specialization for streamK/ splitK as a part of this PR ?

include/cutlass/gemm/kernel/intel_pvc_gemm_streamk.hpp

include/cutlass/gemm/kernel/intel_pvc_persistent_tile_scheduler_params_streamk.hpp

examples/sycl/pvc/pvc_gemm_streamk.cpp

include/cutlass/gemm/kernel/intel_pvc_tile_scheduler_streamk.hpp

…tlass-fork into intel_pvc_streamk

* Instantiate new accumulator registers per iteration

AD2605 · 2024-09-11T10:22:01Z

In light of #138 , could you rename the newly added intel_pvc_* files to xe_* ?, it would be one less change later on

rolandschulz · 2024-09-11T20:10:54Z

include/cutlass/gemm/kernel/intel_pvc_tile_scheduler_streamk.hpp

+        BlockStripedReduceT::store(reduction_workspace_array, *accumulator_array, barrier_group_thread_idx);
+      }
+      else {
+        // Wait until the preceding split added its accumulators


why does this need to wait on the preceding one even for non-deterministic?

rolandschulz · 2024-09-11T21:43:23Z

include/cutlass/gemm/kernel/intel_pvc_tile_scheduler_streamk.hpp

+        BarrierManager::wait_eq(barrier_idx, lock_workspace, barrier_group_thread_idx, lock_idx, work_tile_info.K_idx);
+
+        // Perform reduction in workspace
+        BlockStripedReduceT::reduce(reduction_workspace_array, *accumulator_array, barrier_group_thread_idx);


for the deterministic case, why does that need to be reduce(/atomic_add) and load_add isn't sufficient? My understanding is that for deterministic case only a single K_idx is adding at the same time and if so atomic shouldn't be needed.

include/cutlass/gemm/kernel/intel_pvc_tile_scheduler_streamk.hpp

examples/sycl/pvc/pvc_gemm_streamk.cpp

…tlass-fork into intel_pvc_streamk

muhammad-tanvir-1211 · 2024-09-16T15:22:37Z

In light of #138 , could you rename the newly added intel_pvc_* files to xe_* ?, it would be one less change later on

Done.

* Removed l2 workspace alignment

cmake/FindDPCPP.cmake

include/cutlass/arch/barrier.h

include/cutlass/gpu_generics.h

include/cutlass/workspace.h

…tlass-fork into intel_pvc_streamk

muhammad-tanvir-1211 added 13 commits August 21, 2024 15:13

WIP: Introduce StreamK for PVC

49a986a

Merge branch 'sycl-develop' of https://github.com/codeplaysoftware/cu…

758038a

…tlass-fork into intel_pvc_streamk

fixed starting index calculation

ce8b3a2

Fixed barrier count update

9599d62

Fixed compilation for normal GEMM

76e54f9

Perform fixup using threadid instead of subgroup_id

b072873

Fixed the k_idx offset for MMA atom and corrected the reduction offse…

6be72be

…t calculation

Use log2 for available_xecores

e00f9da

SplitK working

224b316

Minor cleanup

59b884f

* Need to fix splitK for batch > 1

Fixed splitK for batch > 1

4e9f9c3

Re-enabled GEMM Universal Adater specialization

345dcae

Update split barrier arguments

05b487a

muhammad-tanvir-1211 marked this pull request as draft September 2, 2024 16:25

muhammad-tanvir-1211 and others added 3 commits September 2, 2024 17:28

Minor cleanup

bff4801

Changed initialization to workspace only

3300bf7

Merge branch 'sycl-develop' into intel_pvc_streamk

0db7398

muhammad-tanvir-1211 force-pushed the intel_pvc_streamk branch from b6d7abf to 58d082e Compare September 3, 2024 13:29

Fix CI failure

f559e31

muhammad-tanvir-1211 force-pushed the intel_pvc_streamk branch from 58d082e to f559e31 Compare September 3, 2024 13:58

muhammad-tanvir-1211 added 2 commits September 4, 2024 12:04

Added support for scheduling non-uniform tiles

bcf812e

Only include split barrier flags for PVC

1544d51

muhammad-tanvir-1211 force-pushed the intel_pvc_streamk branch 2 times, most recently from cef3fec to e4fefc2 Compare September 4, 2024 11:59

Test

2ab1cd8

muhammad-tanvir-1211 force-pushed the intel_pvc_streamk branch from e4fefc2 to 2ab1cd8 Compare September 4, 2024 12:19

muhammad-tanvir-1211 added 2 commits September 4, 2024 14:32

Code cleanup

750ee3a

Add separate example for StreamK

8924208

muhammad-tanvir-1211 requested review from mehdi-goli and AD2605 September 4, 2024 13:52

AD2605 reviewed Sep 6, 2024

View reviewed changes

muhammad-tanvir-1211 added 4 commits September 6, 2024 16:39

Address feedback for split barrier

e8b2d24

Merge branch 'sycl-develop' of https://github.com/codeplaysoftware/cu…

16f31a2

…tlass-fork into intel_pvc_streamk

Fix address space for atomicAdd

c3875e7

* Instantiate new accumulator registers per iteration

Renamed the pipeline file

7cfbf62

rolandschulz reviewed Sep 11, 2024

View reviewed changes

AD2605 reviewed Sep 12, 2024

View reviewed changes

include/cutlass/gemm/kernel/intel_pvc_tile_scheduler_streamk.hpp Outdated Show resolved Hide resolved

rolandschulz reviewed Sep 13, 2024

View reviewed changes

examples/sycl/pvc/pvc_gemm_streamk.cpp Show resolved Hide resolved

Merge branch 'sycl-develop' of https://github.com/codeplaysoftware/cu…

e08e740

…tlass-fork into intel_pvc_streamk

muhammad-tanvir-1211 force-pushed the intel_pvc_streamk branch from 82d7092 to db705da Compare September 16, 2024 15:20

Renamed files to xe_*

93d87fc

* Removed l2 workspace alignment

muhammad-tanvir-1211 force-pushed the intel_pvc_streamk branch from db705da to 93d87fc Compare September 16, 2024 15:51

muhammad-tanvir-1211 added 3 commits September 18, 2024 15:04

Update the example to reduce caching effects

e2a0d9b

Refactor pipeline code

6d19000

Add the option to invoke data parallel decomposition

c06a28e