Swizzle Layout in Shared Memory. #38

KuangjuX · 2025-01-15T07:54:29Z

This issue aims to discuss the arrangement of Swizzle Layout in SMEM, as well as the process of copying from GMEM to SMEM and from SMEM to RMEM.

Load

Firstly, to efficiently load data from GMEM (Global Memory) into SMEM (Shared Memory), it is essential to consider memory access coalescing. Taking half-precision as an example, utilizing a shape of 4 × 64 can maximize memory access efficiency.

However, during the process of loading data from SMEM to RMEM, Tensor Core requires the input of a 16 × 16 matrix. As a result, it is necessary to perform a reshape operation on the SMEM data in this stage. Determining an appropriate Swizzle Layout requires a comprehensive consideration of both of these loading processes.

Swizzle<3,3,3> appears to be a favorable choice. In Swizzle<B, M, S>, B represents the number of rows within a block that undergo swizzling, M denotes the width of the Mask, implying that the data order within the mask remains unchanged, and S signifies the number of mask tiles in a single row. Therefore, Swizzle<3, 3, 3> implies that there are 8 rows within a Swizzle block, with each row containing 8 × 8 = 64 elements, and the order of every 8 elements remains unchanged internally.

The data volume of a Swizzle<3,3,3> Layout is obtained by a Warp performing vectorized loading twice along the row dimension, resulting in 8 × 64 half-precision tiles.

The above diagram illustrates a 32 × 64 half-precision shared memory tiles without Swizzle, where bank conflicts occur when loading data from SMEM to RMEM, for example, between T0 and T1.

After applying Swizzle<3, 3, 3>, the threads remap the physical memory indices they access. The remapped memory access by the threads is shown in the figure above. It can be observed that the memory accesses of T0-T8 and T16-T23 are distributed across different banks, enabling parallel access without causing bank conflicts. Since bank conflicts only occur within a single transaction, we only need to consider whether there are bank conflicts in the four transactions: T0-T7, T8-T15, T16-T23, and T24-T31. In this scenario, no bank conflicts will arise.

To execute a transfer from SMEM to RMEM using Swizzle<3, 3, 3>, it is necessary to construct at least a 16 × 64 half-precision tile, which means a warp must load data four times along the row dimension.

In order to assess whether Swizzle<2, 3, 3> can avoid Bank Conflict, I conducted a verification, which revealed that it would result in a 4-way Bank Conflict.

The text was updated successfully, but these errors were encountered:

KuangjuX added dicussion Somethind need to be dicussed enhancement New feature or request labels Jan 15, 2025

KuangjuX self-assigned this Jan 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Swizzle Layout in Shared Memory. #38

Swizzle Layout in Shared Memory. #38

KuangjuX commented Jan 15, 2025 •

edited

Loading

Swizzle Layout in Shared Memory. #38

Swizzle Layout in Shared Memory. #38

Comments

KuangjuX commented Jan 15, 2025 • edited Loading

Load

KuangjuX commented Jan 15, 2025 •

edited

Loading