Transfer data tile between memory hierarchies as a macro kernel. #25

haruhi55 · 2024-04-30T07:57:30Z

haruhi55
Apr 30, 2024
Maintainer

The idea is to view each GPU kernel as scheduling a load, store, and compute pipeline.
The goal is to make it easier to write complex fused kernels by providing a set of higher-level programming concepts or interfaces on top of CuTe's low-level abstractions.

I would like to start a summary and discussion about "copy data tile" as a macro kernel.

haruhi55 · 2024-05-09T02:11:09Z

haruhi55
May 9, 2024
Maintainer Author

I'm trying to understand the problem being raised in this issue #5. The critical constraints for a code emitter should be considered for me. Otherwise, I just loss important criteria to determine what is a reasonable structure to organize the kernel implementations.
By thinking a code emitter as a graph traverse algorithm with nodes and edges should have accurate operational semantics. I would like to raise my question:

by looking into the implementations we have on hand as below, what troubles it makes for creating a clean code emitter? And I would like make the macro kernel implementation fit the code emitter.

TiledCUDA/src/kernels/cute_gemm.cu

Lines 64 to 70 in 8205e7c

    
           for (int i = 0; i < rA.get_iters(); ++i) { 
        
               rA.copy(i);  // load A register tile from shared memory 
        
               rB.copy(i);  // load B register tile from shared memory 
        
               gemm(mma, rA[i], rB[i], 
        
                    acc);  // compute using tcu's wmma instruction 
        
           }

I summarize some points about implementing "copying" between memory hierarchy (or data movements) as macro kernels, correct me if I am not accurate or miss some points is in your mind.

Since the copy process we're discussing is not atomic, how many concepts should a programmer be given to describe the correct logic when implementing a copy operation that uses multiple threads to move data with different shapes between different levels of memory, while the transpiler determines the high-performance implementation?

By observing existing implementations, at least, three things should be determined:

Data has a shape and multi-dimensional logical indices to address its underlying elements, and layout describes the logical indices and physical addresses.
Threads have a shape and logical multi-dimensional indices assigned to them, and layout describes the multi-dimensional indices and 1-dimensional indices.

Given the first two, a transpiler can completely infer the data that a single thread is responsible for moving, which is a tile that has a shape and layout. Then, comes the last things should be given:

Given the data that a single working thread is assigned, determine whether each of its dimensions is a 'spatial dimension' or a 'temporal dimension'.

The temporal dimension is then translated into "loop nests" (if there are multiple temporal dimensions) inside a kernel:

// this loop comes from the temporal dimension of the data tile copy by a single thread
for(int i = 0; i < N; ++i) {
    copy(src_tile_ptr, src_layout, dst_tile_ptr, dst_layout, direction, thread_layout)

    // advance src_tile_ptr, dst_tile_ptr, or iterator?
}

Ideally, all data transfers between different levels of memory hierarchy should use consistent concepts to avoid the need for too many ad-hoc rules in the code emitter.

It may look like this:

// @tparam src_ptr: pointer to the source data
// @tparam src_layout: layout of the source data
// @tparam dst_ptr: pointer to the destination data
// @tparam dst_layout: layout of the destination data
// @tparam direction: direction of the copy
// @tparam thread_layout: layout of the thread
// @tparam tid: thread id
template <typename Element, typename SrcLayout, typename DstLayout,
          typename ThreadLayout>
DEVICE void copy_2d_tile(const Element* src, const SrcLayout& s_layout,
                         Element* dst, const DstLayout& d_layout,
                         const ThreadLayout& t_layout, int tid) {
    // ....
}

In principle, I would like to separate the implementation of a macro kernel into two parts: the hardware-dependent configuration and the compute logic.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TiledTensor

Transfer data tile between memory hierarchies as a macro kernel. #25

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

TiledTensor

Transfer data tile between memory hierarchies as a macro kernel. #25

haruhi55 Apr 30, 2024 Maintainer

Replies: 1 comment

haruhi55 May 9, 2024 Maintainer Author

haruhi55
Apr 30, 2024
Maintainer

haruhi55
May 9, 2024
Maintainer Author