You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue aims to discuss the arrangement of Swizzle Layout in SMEM, as well as the process of copying from GMEM to SMEM and from SMEM to RMEM.
Load
Firstly, to efficiently load data from GMEM (Global Memory) into SMEM (Shared Memory), it is essential to consider memory access coalescing. Taking half-precision as an example, utilizing a shape of 4 × 64 can maximize memory access efficiency.
However, during the process of loading data from SMEM to RMEM, Tensor Core requires the input of a 16 × 16 matrix. As a result, it is necessary to perform a reshape operation on the SMEM data in this stage. Determining an appropriate Swizzle Layout requires a comprehensive consideration of both of these loading processes.
Swizzle<3,3,3> appears to be a favorable choice. In Swizzle<B, M, S>, B represents the number of rows within a block that undergo swizzling, M denotes the width of the Mask, implying that the data order within the mask remains unchanged, and S signifies the number of mask tiles in a single row. Therefore, Swizzle<3, 3, 3> implies that there are 8 rows within a Swizzle block, with each row containing 8 × 8 = 64 elements, and the order of every 8 elements remains unchanged internally.
The data volume of a Swizzle<3,3,3> Layout is obtained by a Warp performing vectorized loading twice along the row dimension, resulting in 8 × 64 half-precision tiles.
The above diagram illustrates a 32 × 64 half-precision shared memory tiles without Swizzle, where bank conflicts occur when loading data from SMEM to RMEM, for example, between T0 and T1.
After applying Swizzle<3, 3, 3>, the threads remap the physical memory indices they access. The remapped memory access by the threads is shown in the figure above. It can be observed that the memory accesses of T0-T8 and T16-T23 are distributed across different banks, enabling parallel access without causing bank conflicts. Since bank conflicts only occur within a single transaction, we only need to consider whether there are bank conflicts in the four transactions: T0-T7, T8-T15, T16-T23, and T24-T31. In this scenario, no bank conflicts will arise.
To execute a transfer from SMEM to RMEM using Swizzle<3, 3, 3>, it is necessary to construct at least a 16 × 64 half-precision tile, which means a warp must load data four times along the row dimension.
In order to assess whether Swizzle<2, 3, 3> can avoid Bank Conflict, I conducted a verification, which revealed that it would result in a 4-way Bank Conflict.
The text was updated successfully, but these errors were encountered:
This issue aims to discuss the arrangement of Swizzle Layout in SMEM, as well as the process of copying from GMEM to SMEM and from SMEM to RMEM.
Load
Firstly, to efficiently load data from GMEM (Global Memory) into SMEM (Shared Memory), it is essential to consider memory access coalescing. Taking half-precision as an example, utilizing a shape of 4 × 64 can maximize memory access efficiency.
However, during the process of loading data from SMEM to RMEM, Tensor Core requires the input of a 16 × 16 matrix. As a result, it is necessary to perform a
reshape
operation on the SMEM data in this stage. Determining an appropriate Swizzle Layout requires a comprehensive consideration of both of these loading processes.Swizzle<3,3,3>
appears to be a favorable choice. InSwizzle<B, M, S>
,B
represents the number of rows within a block that undergo swizzling,M
denotes the width of the Mask, implying that the data order within the mask remains unchanged, andS
signifies the number of mask tiles in a single row. Therefore,Swizzle<3, 3, 3>
implies that there are 8 rows within a Swizzle block, with each row containing 8 × 8 = 64 elements, and the order of every 8 elements remains unchanged internally.The data volume of a
Swizzle<3,3,3>
Layout is obtained by a Warp performing vectorized loading twice along the row dimension, resulting in 8 × 64 half-precision tiles.The above diagram illustrates a 32 × 64 half-precision shared memory tiles without Swizzle, where bank conflicts occur when loading data from SMEM to RMEM, for example, between T0 and T1.
After applying
Swizzle<3, 3, 3>
, the threads remap the physical memory indices they access. The remapped memory access by the threads is shown in the figure above. It can be observed that the memory accesses of T0-T8 and T16-T23 are distributed across different banks, enabling parallel access without causing bank conflicts. Since bank conflicts only occur within a single transaction, we only need to consider whether there are bank conflicts in the four transactions: T0-T7, T8-T15, T16-T23, and T24-T31. In this scenario, no bank conflicts will arise.To execute a transfer from SMEM to RMEM using
Swizzle<3, 3, 3>
, it is necessary to construct at least a16 × 64 half-precision tile
, which means a warp must load data four times along the row dimension.In order to assess whether
Swizzle<2, 3, 3>
can avoid Bank Conflict, I conducted a verification, which revealed that it would result in a 4-way Bank Conflict.The text was updated successfully, but these errors were encountered: