Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any plan to support block size 32? #1329

Open
lllyasviel opened this issue Aug 20, 2024 · 4 comments
Open

Any plan to support block size 32? #1329

lllyasviel opened this issue Aug 20, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@lllyasviel
Copy link

Feature request

support block size 32

Motivation

Recently many models can be better when quanted in block size 32 and many benchmarks are run in 32 as block size.

Several types of models, like vision models and image generators, are also more sensitive to block size, and 32 (or even 16) as block size can be better suited for those tasks.

Your contribution

If one point out where I should look at I can also PR. But I am not sure about compiling with different version

@matthewdouglas
Copy link
Member

We haven't had any immediate plans for this, but if it is useful to the community then we can consider it. I'm making the assumption here that we're talking about 4bit. If we added a naive implementation for this based on the existing kernels it probably wouldn't be ideal from an occupancy standpoint but I think it could potentially work as a first step.

In csrc/ops.cu, we'd be looking at quantizeBlockwise, dequantizeBlockwise, and gemm_4bit_inference_naive. On the Python side, there's assertions around the blocksize in functional.py in quantize_4bit() and dequantize_4bit().

Gentle ping @TimDettmers - any thoughts/concerns?

@lllyasviel
Copy link
Author

Thanks for the comments.

Recently image generators are entering large model era like Flux and in need for low bit computation.

The influence of block size is much more salient in image models than LLMs. As a result, many people are currently using slower pytorch workarounds like

d = blocks[:, :2].view(torch.float16)
qs = blocks[:, 2:]
qs = qs.reshape((n_blocks, -1, 1, block_size // 2)) >> torch.tensor([0, 4], device=d.device, dtype=torch.uint8).reshape((1, 1, 2, 1))
qs = (qs & 0x0F).reshape((n_blocks, -1)).to(torch.int8) - 8

to get smaller block size for those image models.

It will be a great advancement for low bit image generation models if bnb can support smaller block size natively like 32 (or even 16)

@matthewdouglas matthewdouglas added the enhancement New feature or request label Aug 23, 2024
@TimDettmers
Copy link
Collaborator

Instead of replying, I quickly tried to implement it, but I failed. Despite this is might be a good starting point to implement this. You can find my changed on the branch small_blocksizes: https://github.com/bitsandbytes-foundation/bitsandbytes/tree/small_blocksizes

I tried this before, but there is one main problem with the kernels: they operate on a warp level which assumes 32 values in total, but for 4-bit there are only 16 values to process when quantized as packed char values. This causes problems. I fixed some problems of these, but currently the kernel has a bug.

The bug is likely related to my change: instead of storing 16 values, I store 32 values. I thought valid_items would make sure that the right amount of values are stored, but this does not seem true. As such, the bug is likely in one of these locations:

https://github.com/bitsandbytes-foundation/bitsandbytes/blob/small_blocksizes/csrc/kernels.cu#L738
https://github.com/bitsandbytes-foundation/bitsandbytes/blob/small_blocksizes/csrc/kernels.cu#L825

I will not have time to look into more detail, but I hope this draft can help you develop a PR that works. I already added tests for the block sizes which you can run via:

pytest -vsk 32-nested

I think this is an important contribution and it would be awesome if you could work on this PR!r

@lllyasviel
Copy link
Author

Thanks a lot for the draft! I took a look – I may be wrong but it seems that all changes are on the quantizing part? Does this mean once I have an already quantized model, I can just infer in block size 32 using existing codes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants