Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vulkan: mul_mat: fix UB with small warps #952

Merged
merged 1 commit into from
Sep 30, 2024

Conversation

smeso
Copy link
Contributor

@smeso smeso commented Sep 7, 2024

When the device's warp size is less than 16,
it is possible for loadstride_a and loadstride_b to be set to 0.
Because they are calculated as: the workgroup size, multiplied by LOAD_VEC_* (which can be 1) and divided by 16. And the workgroup size is set to be the same as the warp/subgroup size.

The loadstride_* variables are used as increments in the loops that populate the buffers used for the multiplication.

When they are 0, they cause an infinite loop.
But infinite loops without side-effects are UB and the values of loadstride_* are known at compile time.
So, the compiler quietly optimizes all the loops away. As a consequence, the buffers are not populated and the multiplication result is just a matrix with all elements set to 0.

We prevent the UB by making sure that the workgroup size will never be less than 16, even if our device has a smaller warp size (e.g. 8).

When the device's warp size is less than 16,
it is possible for loadstride_a (mul_mm.comp:114)
and loadstride_b (mul_mm.comp:115) to be set to 0.
Because they are calculated as: the workgroup size,
multiplied by LOAD_VEC_* (which can be 1) and divided by 16.
And the workgroup size is set to be the same as the
warp/subgroup size.

The loadstride_* variables are used as increments in the
loops that populate the buffers used for the multiplication.

When they are 0 they cause an infinite loop.
But infinite loops without side-effects are UB and the
values of loadstride_* are known at compile time.
So, the compiler quietly optimizes all the loops away.
As a consequence, the buffers are not populated and
the multiplication result is just a matrix with all elements
set to 0.

We prevent the UB by making sure that the workgroup size
will never be less than 16, even if our device has a
smaller warp size (e.g. 8).

Signed-off-by: Salvatore Mesoraca <[email protected]>
@ggerganov
Copy link
Owner

cc @0cc4m

@0cc4m
Copy link
Collaborator

0cc4m commented Sep 25, 2024

It's true that the shader was written for devices with warp size of 32 or 64. It breaks for smaller values. Does it even output correct results with warp size 16 or is the result still wrong?

I don't think I have a way to test this.

@smeso
Copy link
Contributor Author

smeso commented Sep 29, 2024

When I tested it, it passed all tests (e.g. test-backend-ops). So it seems to work, but I don't know if there is any corner case in which it would return incorrect results. I can add and run more tests if you can think of anything else that is worth trying.

Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smeso No, your confirmation that it works is enough. Looks good, this won't affect other devices.

@ggerganov ggerganov merged commit d57505a into ggerganov:master Sep 30, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants