Replies: 1 comment
-
I'm trying to understand the problem being raised in this issue #5. The critical constraints for a code emitter should be considered for me. Otherwise, I just loss important criteria to determine what is a reasonable structure to organize the kernel implementations. by looking into the implementations we have on hand as below, what troubles it makes for creating a clean code emitter? And I would like make the macro kernel implementation fit the code emitter. TiledCUDA/src/kernels/cute_gemm.cu Lines 64 to 70 in 8205e7c I summarize some points about implementing "copying" between memory hierarchy (or data movements) as macro kernels, correct me if I am not accurate or miss some points is in your mind. Since the copy process we're discussing is not atomic, how many concepts should a programmer be given to describe the correct logic when implementing a copy operation that uses multiple threads to move data with different shapes between different levels of memory, while the transpiler determines the high-performance implementation? By observing existing implementations, at least, three things should be determined:
Given the first two, a transpiler can completely infer the data that a single thread is responsible for moving, which is a tile that has a shape and layout. Then, comes the last things should be given:
The temporal dimension is then translated into "loop nests" (if there are multiple temporal dimensions) inside a kernel: // this loop comes from the temporal dimension of the data tile copy by a single thread
for(int i = 0; i < N; ++i) {
copy(src_tile_ptr, src_layout, dst_tile_ptr, dst_layout, direction, thread_layout)
// advance src_tile_ptr, dst_tile_ptr, or iterator?
} Ideally, all data transfers between different levels of memory hierarchy should use consistent concepts to avoid the need for too many ad-hoc rules in the code emitter. It may look like this: // @tparam src_ptr: pointer to the source data
// @tparam src_layout: layout of the source data
// @tparam dst_ptr: pointer to the destination data
// @tparam dst_layout: layout of the destination data
// @tparam direction: direction of the copy
// @tparam thread_layout: layout of the thread
// @tparam tid: thread id
template <typename Element, typename SrcLayout, typename DstLayout,
typename ThreadLayout>
DEVICE void copy_2d_tile(const Element* src, const SrcLayout& s_layout,
Element* dst, const DstLayout& d_layout,
const ThreadLayout& t_layout, int tid) {
// ....
} In principle, I would like to separate the implementation of a macro kernel into two parts: the hardware-dependent configuration and the compute logic. |
Beta Was this translation helpful? Give feedback.
-
The idea is to view each GPU kernel as scheduling a load, store, and compute pipeline.
The goal is to make it easier to write complex fused kernels by providing a set of higher-level programming concepts or interfaces on top of CuTe's low-level abstractions.
I would like to start a summary and discussion about "copy data tile" as a macro kernel.
Beta Was this translation helpful? Give feedback.
All reactions