-
Notifications
You must be signed in to change notification settings - Fork 631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Communicate blocksize constraints to kernels that take blocksize as a runtime argument #1317
Comments
Hi @mm04926412, we discussed briefly on the CUDA MODE Discord. Happy to look at any PRs for this, and ideally some data to show the impact would be appreciated! Feel free to reach out with any questions. |
@mm04926412 thanks for raising this. Honestly, I'm not sure if there's a deeper reason for this. I'll raise it with Tim this Friday and let you know what he says. Cool that you're looking into our kernels! Feel free to ping us about stuff, we'd be happy to draw in more enthusiasts and continue making BNB even more of a community project.. |
Very good catch. I think this would be a good contribution. In general, the latency overhead over dequantization operation is currently the biggest slowdown for these kernels. I think if you replace the division by a bitshift and add software constraints as an assert/error message for the block size to be powers of 2 that might be the best solution. I think I wrote these kernels in this way to allow for faster prototyping and debugging performance issues with different GPUs with different powers. As such, this is an artifact of prototyping, and replacing the division with a bitshift should be a strict improvement. |
Great. I don't have a testing environment for the library or anything to test with, so can do a PR once I get that setup. For this change would it be best for me to make a PR or would you rather I leave it with you? What tests / benchmarks would you like for a PR? |
Unfortunately it looks like I need to reinstall linux (I'm on an old linux mint workstation) in order to properly compile the repo. So unfortuantely I'm not in a position to do the PR until I can put aside time to do that and get a proper Cmake and CUDA environment. I've provided a suggested form of the solution but sadly can't do much more. |
@mm04926412 No problem! Your suggested form is more or less what I was expecting to see (except in my head I was using |
Feature request
Looking through the code base I've noticed in places like kgemm_4bit_inference_naive that there is integer division by block_size on GPU where block_size is a runtime argument, not a template argument. On the python front end there is a constraint that blocksize be a power of 2 but that isn't communicated to the kernel. integer division without a bitshift simplification has poor performance on GPU. Rewrite these kernels so that they can replace the integer divisions with bitshifts.
Motivation
Integer division is slow but not with powers of two, the kernels don't know they can just bitshift because the constraint is only enforced on the python front end.
Your contribution
I'd be happy to submit a PR to resolve this if there isn't some deeper reason why things are written this way.
The text was updated successfully, but these errors were encountered: