You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the current implementation, 16 bytes are used as the unit for a single access, allowing the $16 \times 16$ matrix to be tiled in contiguous memory:
DEVICE voidcopy(const DType* src, DType* dst) {
// a single memory access access 16 bytes
ld_global_st_shared<16>(
static_cast<uint32_t>(__cvta_generic_to_shared(dst)), src);
}
In order to maximize memory access coalescing, we need to use 128 bytes as the unit for a single memory access. In this case, we tile the [kTM, kTK] block in shared memory. To meet the requirements of Swizzle<3, 3, 3> and Tensor Core MMA, kTK must be a multiple of 64, and kTM must be a multiple of 16.
Within shared memory, the tiling of blocks needs to be arranged according to Swizzle<3, 3, 3>. When accessing a 2D index (x, y), it must be converted into an intra-tile index (in_tile_x, in_tile_y) and an inter-tile index (tile_x, tile_y) for swizzle mapping.
The text was updated successfully, but these errors were encountered:
In the current implementation, 16 bytes are used as the unit for a single access, allowing the$16 \times 16$ matrix to be tiled in contiguous memory:
In order to maximize memory access coalescing, we need to use 128 bytes as the unit for a single memory access. In this case, we tile the
[kTM, kTK]
block in shared memory. To meet the requirements ofSwizzle<3, 3, 3>
and Tensor Core MMA,kTK
must be a multiple of 64, andkTM
must be a multiple of 16.Within shared memory, the tiling of blocks needs to be arranged according to
Swizzle<3, 3, 3>
. When accessing a 2D index(x, y)
, it must be converted into an intra-tile index(in_tile_x, in_tile_y)
and an inter-tile index(tile_x, tile_y)
for swizzle mapping.The text was updated successfully, but these errors were encountered: