Refactor GMEM to SMEM Loader/Storer with `SwizzleLayout` #43

KuangjuX · 2025-01-18T01:18:38Z

In the current implementation, 16 bytes are used as the unit for a single access, allowing the $16 \times 16$ matrix to be tiled in contiguous memory:

    DEVICE void copy(const DType* src, DType* dst) {
        // a single memory access access 16 bytes
        ld_global_st_shared<16>(
            static_cast<uint32_t>(__cvta_generic_to_shared(dst)), src);
    }

In order to maximize memory access coalescing, we need to use 128 bytes as the unit for a single memory access. In this case, we tile the [kTM, kTK] block in shared memory. To meet the requirements of Swizzle<3, 3, 3> and Tensor Core MMA, kTK must be a multiple of 64, and kTM must be a multiple of 16.

Within shared memory, the tiling of blocks needs to be arranged according to Swizzle<3, 3, 3>. When accessing a 2D index (x, y), it must be converted into an intra-tile index (in_tile_x, in_tile_y) and an inter-tile index (tile_x, tile_y) for swizzle mapping.

The text was updated successfully, but these errors were encountered:

KuangjuX added the enhancement New feature or request label Jan 18, 2025

KuangjuX self-assigned this Jan 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor GMEM to SMEM Loader/Storer with `SwizzleLayout` #43

Refactor GMEM to SMEM Loader/Storer with `SwizzleLayout` #43

KuangjuX commented Jan 18, 2025 •

edited

Loading

Refactor GMEM to SMEM Loader/Storer with SwizzleLayout #43

Refactor GMEM to SMEM Loader/Storer with SwizzleLayout #43

Comments

KuangjuX commented Jan 18, 2025 • edited Loading

Refactor GMEM to SMEM Loader/Storer with `SwizzleLayout` #43

Refactor GMEM to SMEM Loader/Storer with `SwizzleLayout` #43

KuangjuX commented Jan 18, 2025 •

edited

Loading