Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor GMEM to SMEM Loader/Storer with SwizzleLayout #43

Open
KuangjuX opened this issue Jan 18, 2025 · 0 comments
Open

Refactor GMEM to SMEM Loader/Storer with SwizzleLayout #43

KuangjuX opened this issue Jan 18, 2025 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@KuangjuX
Copy link
Collaborator

KuangjuX commented Jan 18, 2025

In the current implementation, 16 bytes are used as the unit for a single access, allowing the $16 \times 16$ matrix to be tiled in contiguous memory:

    DEVICE void copy(const DType* src, DType* dst) {
        // a single memory access access 16 bytes
        ld_global_st_shared<16>(
            static_cast<uint32_t>(__cvta_generic_to_shared(dst)), src);
    }

In order to maximize memory access coalescing, we need to use 128 bytes as the unit for a single memory access. In this case, we tile the [kTM, kTK] block in shared memory. To meet the requirements of Swizzle<3, 3, 3> and Tensor Core MMA, kTK must be a multiple of 64, and kTM must be a multiple of 16.

Within shared memory, the tiling of blocks needs to be arranged according to Swizzle<3, 3, 3>. When accessing a 2D index (x, y), it must be converted into an intra-tile index (in_tile_x, in_tile_y) and an inter-tile index (tile_x, tile_y) for swizzle mapping.

@KuangjuX KuangjuX added the enhancement New feature or request label Jan 18, 2025
@KuangjuX KuangjuX self-assigned this Jan 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant