vulkan: mul_mat: fix UB with small warps #952

smeso · 2024-09-07T09:46:03Z

When the device's warp size is less than 16,
it is possible for loadstride_a and loadstride_b to be set to 0.
Because they are calculated as: the workgroup size, multiplied by LOAD_VEC_* (which can be 1) and divided by 16. And the workgroup size is set to be the same as the warp/subgroup size.

The loadstride_* variables are used as increments in the loops that populate the buffers used for the multiplication.

When they are 0, they cause an infinite loop.
But infinite loops without side-effects are UB and the values of loadstride_* are known at compile time.
So, the compiler quietly optimizes all the loops away. As a consequence, the buffers are not populated and the multiplication result is just a matrix with all elements set to 0.

We prevent the UB by making sure that the workgroup size will never be less than 16, even if our device has a smaller warp size (e.g. 8).

When the device's warp size is less than 16, it is possible for loadstride_a (mul_mm.comp:114) and loadstride_b (mul_mm.comp:115) to be set to 0. Because they are calculated as: the workgroup size, multiplied by LOAD_VEC_* (which can be 1) and divided by 16. And the workgroup size is set to be the same as the warp/subgroup size. The loadstride_* variables are used as increments in the loops that populate the buffers used for the multiplication. When they are 0 they cause an infinite loop. But infinite loops without side-effects are UB and the values of loadstride_* are known at compile time. So, the compiler quietly optimizes all the loops away. As a consequence, the buffers are not populated and the multiplication result is just a matrix with all elements set to 0. We prevent the UB by making sure that the workgroup size will never be less than 16, even if our device has a smaller warp size (e.g. 8). Signed-off-by: Salvatore Mesoraca <[email protected]>

ggerganov · 2024-09-23T16:26:26Z

cc @0cc4m

0cc4m · 2024-09-25T19:49:39Z

It's true that the shader was written for devices with warp size of 32 or 64. It breaks for smaller values. Does it even output correct results with warp size 16 or is the result still wrong?

I don't think I have a way to test this.

smeso · 2024-09-29T17:25:25Z

When I tested it, it passed all tests (e.g. test-backend-ops). So it seems to work, but I don't know if there is any corner case in which it would return incorrect results. I can add and run more tests if you can think of anything else that is worth trying.

0cc4m

@smeso No, your confirmation that it works is enough. Looks good, this won't affect other devices.

smeso force-pushed the smeso/vulkan/mulmm branch from 9a2a139 to 5147f74 Compare September 7, 2024 09:46

0cc4m approved these changes Sep 30, 2024

View reviewed changes

ggerganov merged commit d57505a into ggerganov:master Sep 30, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: mul_mat: fix UB with small warps #952

vulkan: mul_mat: fix UB with small warps #952

smeso commented Sep 7, 2024

ggerganov commented Sep 23, 2024

0cc4m commented Sep 25, 2024

smeso commented Sep 29, 2024

0cc4m left a comment

vulkan: mul_mat: fix UB with small warps #952

vulkan: mul_mat: fix UB with small warps #952

Conversation

smeso commented Sep 7, 2024

ggerganov commented Sep 23, 2024

0cc4m commented Sep 25, 2024

smeso commented Sep 29, 2024

0cc4m left a comment

Choose a reason for hiding this comment