Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swizzle Layout in Shared Memory. #38

Open
KuangjuX opened this issue Jan 15, 2025 · 0 comments
Open

Swizzle Layout in Shared Memory. #38

KuangjuX opened this issue Jan 15, 2025 · 0 comments
Assignees
Labels
dicussion Somethind need to be dicussed enhancement New feature or request

Comments

@KuangjuX
Copy link
Collaborator

KuangjuX commented Jan 15, 2025

This issue aims to discuss the arrangement of Swizzle Layout in SMEM, as well as the process of copying from GMEM to SMEM and from SMEM to RMEM.

Load

Firstly, to efficiently load data from GMEM (Global Memory) into SMEM (Shared Memory), it is essential to consider memory access coalescing. Taking half-precision as an example, utilizing a shape of 4 × 64 can maximize memory access efficiency.

However, during the process of loading data from SMEM to RMEM, Tensor Core requires the input of a 16 × 16 matrix. As a result, it is necessary to perform a reshape operation on the SMEM data in this stage. Determining an appropriate Swizzle Layout requires a comprehensive consideration of both of these loading processes.

Swizzle<3,3,3> appears to be a favorable choice. In Swizzle<B, M, S>, B represents the number of rows within a block that undergo swizzling, M denotes the width of the Mask, implying that the data order within the mask remains unchanged, and S signifies the number of mask tiles in a single row. Therefore, Swizzle<3, 3, 3> implies that there are 8 rows within a Swizzle block, with each row containing 8 × 8 = 64 elements, and the order of every 8 elements remains unchanged internally.

The data volume of a Swizzle<3,3,3> Layout is obtained by a Warp performing vectorized loading twice along the row dimension, resulting in 8 × 64 half-precision tiles.

Image

The above diagram illustrates a 32 × 64 half-precision shared memory tiles without Swizzle, where bank conflicts occur when loading data from SMEM to RMEM, for example, between T0 and T1.

Image

After applying Swizzle<3, 3, 3>, the threads remap the physical memory indices they access. The remapped memory access by the threads is shown in the figure above. It can be observed that the memory accesses of T0-T8 and T16-T23 are distributed across different banks, enabling parallel access without causing bank conflicts. Since bank conflicts only occur within a single transaction, we only need to consider whether there are bank conflicts in the four transactions: T0-T7, T8-T15, T16-T23, and T24-T31. In this scenario, no bank conflicts will arise.

To execute a transfer from SMEM to RMEM using Swizzle<3, 3, 3>, it is necessary to construct at least a 16 × 64 half-precision tile, which means a warp must load data four times along the row dimension.

In order to assess whether Swizzle<2, 3, 3> can avoid Bank Conflict, I conducted a verification, which revealed that it would result in a 4-way Bank Conflict.

@KuangjuX KuangjuX added dicussion Somethind need to be dicussed enhancement New feature or request labels Jan 15, 2025
@KuangjuX KuangjuX self-assigned this Jan 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dicussion Somethind need to be dicussed enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant