Communicate blocksize constraints to kernels that take blocksize as a runtime argument #1317

mm04926412 · 2024-08-14T13:28:25Z

Feature request

Looking through the code base I've noticed in places like kgemm_4bit_inference_naive that there is integer division by block_size on GPU where block_size is a runtime argument, not a template argument. On the python front end there is a constraint that blocksize be a power of 2 but that isn't communicated to the kernel. integer division without a bitshift simplification has poor performance on GPU. Rewrite these kernels so that they can replace the integer divisions with bitshifts.

Motivation

Integer division is slow but not with powers of two, the kernels don't know they can just bitshift because the constraint is only enforced on the python front end.

Your contribution

I'd be happy to submit a PR to resolve this if there isn't some deeper reason why things are written this way.

matthewdouglas · 2024-08-14T14:19:46Z

Hi @mm04926412, we discussed briefly on the CUDA MODE Discord. Happy to look at any PRs for this, and ideally some data to show the impact would be appreciated! Feel free to reach out with any questions.

Titus-von-Koeller · 2024-08-14T16:14:39Z

@mm04926412 thanks for raising this. Honestly, I'm not sure if there's a deeper reason for this. I'll raise it with Tim this Friday and let you know what he says.

Cool that you're looking into our kernels! Feel free to ping us about stuff, we'd be happy to draw in more enthusiasts and continue making BNB even more of a community project..

TimDettmers · 2024-08-14T16:21:04Z

Very good catch. I think this would be a good contribution. In general, the latency overhead over dequantization operation is currently the biggest slowdown for these kernels.

I think if you replace the division by a bitshift and add software constraints as an assert/error message for the block size to be powers of 2 that might be the best solution.

I think I wrote these kernels in this way to allow for faster prototyping and debugging performance issues with different GPUs with different powers. As such, this is an artifact of prototyping, and replacing the division with a bitshift should be a strict improvement.

mm04926412 · 2024-08-14T16:41:57Z

Great. I don't have a testing environment for the library or anything to test with, so can do a PR once I get that setup. For this change would it be best for me to make a PR or would you rather I leave it with you? What tests / benchmarks would you like for a PR?

mm04926412 · 2024-08-15T16:02:46Z

Unfortunately it looks like I need to reinstall linux (I'm on an old linux mint workstation) in order to properly compile the repo. So unfortuantely I'm not in a position to do the PR until I can put aside time to do that and get a proper Cmake and CUDA environment. I've provided a suggested form of the solution but sadly can't do much more.

a1259ef

matthewdouglas · 2024-08-15T16:53:28Z

@mm04926412 No problem! Your suggested form is more or less what I was expecting to see (except in my head I was using __clz instead of __ffs). I could take it from here and start a PR. Thank you for looking into this and reporting!

matthewdouglas added waiting for info contributions-welcome We welcome contributions to fix this issue! labels Aug 14, 2024

matthewdouglas self-assigned this Aug 14, 2024

matthewdouglas added the enhancement New feature or request label Aug 14, 2024

matthewdouglas added Low Risk Risk of bugs in transformers and other libraries and removed waiting for info contributions-welcome We welcome contributions to fix this issue! labels Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Communicate blocksize constraints to kernels that take blocksize as a runtime argument #1317

Communicate blocksize constraints to kernels that take blocksize as a runtime argument #1317

mm04926412 commented Aug 14, 2024 •

edited

Loading

matthewdouglas commented Aug 14, 2024

Titus-von-Koeller commented Aug 14, 2024

TimDettmers commented Aug 14, 2024

mm04926412 commented Aug 14, 2024

mm04926412 commented Aug 15, 2024

matthewdouglas commented Aug 15, 2024

Communicate blocksize constraints to kernels that take blocksize as a runtime argument #1317

Communicate blocksize constraints to kernels that take blocksize as a runtime argument #1317

Comments

mm04926412 commented Aug 14, 2024 • edited Loading

Feature request

Motivation

Your contribution

matthewdouglas commented Aug 14, 2024

Titus-von-Koeller commented Aug 14, 2024

TimDettmers commented Aug 14, 2024

mm04926412 commented Aug 14, 2024

mm04926412 commented Aug 15, 2024

matthewdouglas commented Aug 15, 2024

mm04926412 commented Aug 14, 2024 •

edited

Loading