[Quantization] Add quantization support for `bitsandbytes` #9213

sayakpaul · 2024-08-19T05:42:38Z

What does this PR do?

Come back later.

Quantization config class (base and bitsandbytes)
Quantizer class (base and bitsandbytes)
Utilities related to bitsandbytes
from_pretrained() at the ModelMixin level and related changes
save_pretrained()
NF4 tests
INT8 (llm.int8()) tests
Docs

Notes

Even though I alluded to having a separate QuantizationLoaderMixin in [Quantization] bring quantization to diffusers core #9174, I realized that is not an approach we can take because loading and saving a quantized model is very much baked into the arguments of ModelMixin.save_pretrained() and ModelMixin.from_pretrained(). It is deeply entangled.
For the initial quantization support, I think it's okay to not allow passing device_map, because for a pipeline, multiple device_maps can get ugly. This will be dealt with in a follow-up PR by @SunMarc and myself.
For the point above, for checkpoints that are found to be sharded (Flux, for example), I have decided to merge them on CPU to simplify the implementation. This will be dealt with in a follow-up PR by @SunMarc.
The PR has an extensive testing suite covering training, too. However, I have decided not to add it to our CI yet. We should first let this feature flow into the community and then add the tests to our nightly CI.

No-frills code snippets

Serialization

import torch 
from diffusers import BitsAndBytesConfig, FluxTransformer2DModel, FluxPipeline
from accelerate.utils import compute_module_sizes

model_id = "black-forest-labs/FLUX.1-dev"

nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model_nf4 = FluxTransformer2DModel.from_pretrained(
    model_id, subfolder="transformer", quantization_config=nf4_config, torch_dtype=torch.bfloat16
)
assert model_nf4.dtype == torch.uint8, model_nf4.dtype
print(model_nf4.dtype)
print(model_nf4.config.quantization_config)
print(compute_module_sizes(model_nf4)[""] / 1024 / 1024)

push_id = "sayakpaul/flux.1-dev-nf4-with-bnb-integration"
model_nf4.push_to_hub(push_id)

Serialized checkpoint: https://huggingface.co/sayakpaul/flux.1-dev-nf4-with-bnb-integration.

NF4 checkpoints of Flux transformer and T5: https://huggingface.co/sayakpaul/flux.1-dev-nf4-pkg (has Colab Notebooks, too).

Inference

import torch
from diffusers import FluxTransformer2DModel, FluxPipeline

model_id = "black-forest-labs/FLUX.1-dev"
nf4_id = "sayakpaul/flux.1-dev-nf4-with-bnb-integration"
model_nf4 = FluxTransformer2DModel.from_pretrained(nf4_id, torch_dtype=torch.bfloat16)
print(model_nf4.dtype)
print(model_nf4.config.quantization_config)

pipe = FluxPipeline.from_pretrained(model_id, transformer=model_nf4, torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()

prompt = "A mystic cat with a sign that says hello world!"
image = pipe(prompt, guidance_scale=3.5, num_inference_steps=50, generator=torch.manual_seed(0)).images[0]
image.save("flux-nf4-dev-loaded.png")

HuggingFaceDocBuilderDev · 2024-08-19T05:51:18Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

SunMarc

Thanks for adding this ! I see that you used a lot of things from transformers. Do you think it is possible to import these (or inherit) from transformers ? This will help reducing the maintenance. I'm fine also doing that since there are not too many follow-up PR after a quantizer has been added. About the HfQuantizer class, there are a lot of methods that were created to fit transformers structure. I'm not sure we will need eveyone of these methods in diffusers. Ofc, we can still do a follow-up PR to clean up.

src/diffusers/quantizers/base.py

sayakpaul · 2024-08-20T15:01:18Z

@SunMarc I am guilty as charged but we don’t have transformers as a hard dependency for loading models in Diffusers. Pinging @DN6 to seek his opinion.

Update: Chatted with @DN6 as well. We think it's better to redefine inside diffusers without the transformers specific bits which we can clean in this PR.

sayakpaul · 2024-08-22T02:02:49Z

@SunMarc I think this PR is ready for another review.

SunMarc

Thanks for adding this @sayakpaul !

src/diffusers/quantizers/base.py

yiyixuxu

I don't think it makes sense to have this as a separate PR to add a base class because it's hard to understand what methods are needed - we should only introduce a minimum base class and gradually add functionalities as needed

can we have a PR with a minimum example working?

sayakpaul · 2024-08-22T20:10:25Z

Okay, so, do you want me to add everything needed for bitsandbytes integration in this PR? But do note that this won’t be very different from what we have in transformers.

yiyixuxu · 2024-08-22T20:40:07Z

@sayakpaul
I think so because:

it is better to review that way
we don't need this class in diffusers on its own because it cannot be used yet, no?

bghira · 2024-08-22T23:57:31Z

sometimes we can make a feature branch where a bunch of PRs can be merged into before one big honkin' PR is pushed to main at the end. and the pieces are all individually reviewed and can be tested. is this a viable approach for including quantisation?

sayakpaul · 2024-08-23T02:53:39Z

Okay I will update this branch. @yiyixuxu

SunMarc · 2024-08-23T13:30:25Z

cc @MekkCyber for visibility

DN6 · 2024-08-28T08:06:50Z

Just a few considerations for the quantization design.

I would say the initial design should start loading/inference at just the model level and then proceed to add functionality (pipeline level loading etc).

The feature needs to perform the following functions

Perform on the fly quantization of large models so that they can be loaded in a low-memory dtype
1. with from_pretrained
2. with from_single_file
Dynamically upcast to the appropriate compute dtype when running inference
Save/Load already quantized versions of these large models (FP8, NF4)
Allow loading/inference with LoRAs in these quantized models. (This we have to figure out in more detail)

At the moment, the most common ask seems to be the ability to load models into GPU using the FP8 dtype and run inference in a supported dtype by dynamically upcasting the necessary layers. NF4 is another format that's gaining attention.

So perhaps we should focus on this first. This mostly applies to the DiT models but large models like CogVideo might also benefit with this approach.

Some example quantized versions of models that have been doing the rounds

Flux FP8:
- https://huggingface.co/Kijai/flux-fp8 (single file format)
- https://huggingface.co/XLabs-AI/flux-dev-fp8 (quanto/diffusers format)
Flux NF4
- https://huggingface.co/lllyasviel/flux1-dev-bnb-nf4
SD3 FP8:
- https://huggingface.co/stabilityai/stable-diffusion-3-medium/blob/main/sd3_medium_incl_clips_t5xxlfp8.safetensors (single file pipeline)

To cover these initial cases, we can rely on Quanto (FP8) and BitsandBytes (NF4).

Example API:

from diffusers import FluxPipeline, FluxTransformer2DModel, DiffusersQuantoConfig

# Load model in FP8 with Quanto and perform compute in configured dtype. 

quantization_config = DiffusersQuantoConfig(weights="float8", compute_dtype=torch.bfloat16)

FluxTransformer2DModel.from_pretrained("<either diffusers format or quanto format weights>", quantization_config=quantization_config)

pipe = FluxPipeline.from_pretrained("...", transformer=transformer)

The quantization config should probably take the following arguments

DiffusersQuantoConfig(
	weights_dtype="", # dtype to store weights
	compute_dtype="", # dtype to perform inference
	skip_quantize_modules=["ResBlock"]
)

I think initially we can rely on the dynamic upcasting operations performed by Quanto and BnB under the hood to start and then expand on them if needed.

Some other considerations

Since we have transformers models in diffusers that can also benefit from quantized loading, we might want to consider adding a Diffusers prefix to the quantization configs. e.g DiffusersQuantoConfig so that when we import quantization configs from transformers there aren't any conflicts.
For saving and loading models we can start with models saved in Quanto/BnB format.
One possible challenge with Pipeline level quantized loading is that we have a mix of transformers/diffusers models. So a single config to quantize/load both types might not be possible.
Single file loading has it's own set of issues, such as dealing with checkpoints that have been naively quantized. This applies to some of the Flux single file checkpoints. e.g. safetensors.torch.save_file(model.to(torch.float8_e4m3fn), "model-fp8.safetensors) and loading full pipeline single file checkpoints. But we can address these later.

sayakpaul · 2024-08-28T08:15:58Z

This PR will be at the model-level itself. And we should not add multiple backends in a single PR. This PR aims to add bitsandbytes. We can do other backends taking this PR as a reference. I would like us to mutually agree on this before I start making progress on this PR.

Concretely, I would like to stick to the outline of the changes laid out in #9174 (along with anything related) for this PR.

The feature needs to perform the following functions

I won't advocate doing all of that in a single PR because it makes things very hard to review. We would rather want to move faster with something more minimal, confirming their effectiveness.

Allow loading/inference with LoRAs in these quantized models. (This we have to figure out in more detail)

Well, note that if the underlying LoRA wasn't trained with the base quantization precision, it might not perform as expected.

So perhaps we should focus on this first. This mostly applies to the DiT models but large models like CogVideo might also benefit with this approach.

Please note that bitsandbytes related quantization mostly applies to nn.linear whereas quanto is broader in their scopes (i.e, quanto can be applied to an nn.Conv2D as well).

DN6 · 2024-08-28T08:34:33Z

This PR will be at the model-level itself. And we should not add multiple backends in a single PR. This PR aims to add bitsandbytes. We can do other backends taking this PR as a reference. I would like us to mutually agree on this before I start making progress on this PR.

Sounds good to me.

For this PR lets do

from_pretrained only
bnb quantization.

sayakpaul · 2024-10-15T06:05:54Z

Very insightful comments, @yiyixuxu! I think I have resolved them all. LMK.

DN6 · 2024-10-16T10:18:34Z

src/diffusers/quantizers/auto.py

+}
+
+
+class DiffusersAutoQuantizationConfig:


I see this is similar to transformers, but I think the DiffusersAutoQuantConfig class is probably not needed.

This is just a simple mapping to a specific quantization config object. The from_pretrained method in the AutoQuantizer is just wrapping the AutoConfig from_pretrained.

I think we can just move these methods/logic directly into the AutoQuantizer.

If this is not a must-have, could do this in a follow-up PR.

src/diffusers/pipelines/pipeline_utils.py

src/diffusers/models/modeling_utils.py

Co-authored-by: YiYi Xu <[email protected]>

ariG23498 · 2024-10-20T09:07:27Z

Hi folks!

Thanks for working on this. I was able to run the following script on this branch and generate images on my 8 gigs VRAM laptop

from diffusers import FluxPipeline, FluxTransformer2DModel
from transformers import T5EncoderModel
import torch
import gc


def flush():
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.reset_max_memory_allocated()
    torch.cuda.reset_peak_memory_stats()


def bytes_to_giga_bytes(bytes):
    return bytes / 1024 / 1024 / 1024


flush()

ckpt_id = "black-forest-labs/FLUX.1-dev"
ckpt_4bit_id = "sayakpaul/flux.1-dev-nf4-pkg"
prompt = "a billboard on highway with 'FLUX under 8' written on it"

text_encoder_2_4bit = T5EncoderModel.from_pretrained(
    ckpt_4bit_id,
    subfolder="text_encoder_2",
)

pipeline = FluxPipeline.from_pretrained(
    ckpt_id,
    text_encoder_2=text_encoder_2_4bit,
    transformer=None,
    vae=None,
    torch_dtype=torch.float16,
)
pipeline.enable_model_cpu_offload()


with torch.no_grad():
    print("Encoding prompts.")
    prompt_embeds, pooled_prompt_embeds, text_ids = pipeline.encode_prompt(
        prompt=prompt, prompt_2=None, max_sequence_length=256
    )


pipeline = pipeline.to("cpu")
del pipeline

flush()


transformer_4bit = FluxTransformer2DModel.from_pretrained(ckpt_4bit_id, subfolder="transformer")
pipeline = FluxPipeline.from_pretrained(
    ckpt_id,
    text_encoder=None,
    text_encoder_2=None,
    tokenizer=None,
    tokenizer_2=None,
    transformer=transformer_4bit,
    torch_dtype=torch.float16,
)
pipeline.enable_model_cpu_offload()

print("Running denoising.")
height, width = 512, 768
images = pipeline(
    prompt_embeds=prompt_embeds,
    pooled_prompt_embeds=pooled_prompt_embeds,
    num_inference_steps=50,
    guidance_scale=5.5,
    height=height,
    width=width,
    output_type="pil",
).images
images[0].save("output.png")

yiyixuxu

let's merge this!

I asked @DN6 to open a follow-up PR for this #9213 (comment),

sayakpaul · 2024-10-21T04:21:11Z

PR merge contingent on #9720.

DN6 · 2024-10-16T10:31:53Z

src/diffusers/quantizers/quantization_config.py

+
+
+@dataclass
+class BitsAndBytesConfig(QuantizationConfigMixin):


Something to consider. Let's assume you want to use a quantized transformer model in your code. With this naming, you would always need to set up imports in the following way.

from transformers import BitsAndBytesConfig from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig

Not a huge issue. Just giving a heads up incase you want to consider renaming the config to something like DiffusersBitsAndBytesConfig

DN6 · 2024-10-16T10:36:51Z

src/diffusers/models/model_loading_utils.py

+                    set_module_kwargs["dtype"] = dtype
+
+        # bnb params are flattened.
+        if not is_quant_method_bnb and empty_state_dict[param_name].shape != param.shape:


In this situation, aren't we skipping parameter shape checks for bnb loaded weights entirely? What happens when one attempts to load bnb weights but the flattened shape is incorrect?

Perhaps we add a check_quantized_param_shape method to the DiffusersQuantizer base class. And in the BnBQuantizer we can check if the shape matches the rule here:
https://github.com/bitsandbytes-foundation/bitsandbytes/blob/18e827d666fa2b70a12d539ccedc17aa51b2c97c/bitsandbytes/functional.py#L816

DN6 · 2024-10-16T10:38:14Z

src/diffusers/models/model_loading_utils.py

+        if not is_quantized or (
+            not hf_quantizer.check_quantized_param(model, param, param_name, state_dict, param_device=device)
+        ):
+            if accepts_dtype:
+                set_module_tensor_to_device(model, param_name, device, value=param, **set_module_kwargs)
+            else:
+                set_module_tensor_to_device(model, param_name, device, value=param)
        else:
-            set_module_tensor_to_device(model, param_name, device, value=param)
+            hf_quantizer.create_quantized_param(model, param, param_name, device, state_dict, unexpected_keys)
+


Small nit. IMO this is a bit more readable

if is_quantized or hf_quantizer.check_quantized_param( model, param, param_name, state_dict, param_device=device ): hf_quantizer.create_quantized_param(model, param, param_name, device, state_dict, unexpected_keys) else: if accepts_dtype: set_module_tensor_to_device(model, param_name, device, value=param, **set_module_kwargs) else: set_module_tensor_to_device(model, param_name, device, value=param)

DN6 · 2024-10-16T10:43:05Z

src/diffusers/quantizers/base.py

+        """adjust max_memory argument for infer_auto_device_map() if extra memory is needed for quantization"""
+        return max_memory
+
+    def check_quantized_param(


IMO check_is_quantized_param or check_if_quantized_param more explicitly conveys what this method does.

DN6 · 2024-10-16T11:27:20Z

tests/quantization/bnb/test_4bit.py

+
+
+class BnB4BitBasicTests(Base4bitTests):
+    def setUp(self):


Would clear cache on setup as well.

Ednaordinary · 2024-11-03T02:31:49Z

It would be useful to rename llm_int8_skip_modules or otherwise make it more clear that it is respected in both 4bit and 8bit mode, as currently the docs sound like skipped modules are only respected in 8 bit mode while the actual implementation suggests otherwise

diffusers/src/diffusers/quantizers/bitsandbytes/bnb_quantizer.py

Line 60 in 13e8fde

if self.quantization_config.llm_int8_skip_modules is not None:

sayakpaul · 2024-11-03T02:40:30Z

Yeah I think the documentation should reflect this. I guess this is safe to do @SunMarc?

SunMarc · 2024-11-04T17:10:10Z

Yeah we should do that, would you like to update this @Ednaordinary ? We should also do it in transformers when it gets merged.

Ednaordinary · 2024-11-05T21:24:26Z

Sure, @SunMarc. I'll make a PR when I'm able. Should I refactor the parameter name and include a deprecation notice, or just include a note in the docs?

quantization config.

e634ff2

sayakpaul added the quantization label Aug 19, 2024

sayakpaul requested review from DN6 and SunMarc August 19, 2024 05:42

fix-copies

02a6dff

sayakpaul added 2 commits August 20, 2024 11:38

Merge branch 'main' into quantization-config

c385a2b

Merge branch 'main' into quantization-config

0355875

SunMarc reviewed Aug 20, 2024

View reviewed changes

src/diffusers/quantizers/base.py Outdated Show resolved Hide resolved

sayakpaul added 4 commits August 20, 2024 20:31

Merge branch 'main' into quantization-config

e41b494

Merge branch 'main' into quantization-config

dfb33eb

Merge branch 'main' into quantization-config

e492655

fix

6e86cc0

sayakpaul added 2 commits August 22, 2024 07:36

modules_to_not_convert

58a3d15

Merge branch 'main' into quantization-config

1d477f9

SunMarc approved these changes Aug 22, 2024

View reviewed changes

src/diffusers/quantizers/base.py Show resolved Hide resolved

yiyixuxu reviewed Aug 22, 2024

View reviewed changes

Merge branch 'main' into quantization-config

bd7f46d

sayakpaul mentioned this pull request Aug 27, 2024

NF4 Flux params in diffusers #9165

Closed

Merge branch 'main' into quantization-config

d5d7bb6

handle dtype casting.

81bb48a

sayakpaul requested a review from yiyixuxu October 15, 2024 06:05

Merge branch 'main' into quantization-config

c5e62ae

Merge branch 'main' into quantization-config

d023b40

DN6 reviewed Oct 16, 2024

View reviewed changes

Merge branch 'main' into quantization-config

a3d2655

yiyixuxu reviewed Oct 18, 2024

View reviewed changes

sayakpaul and others added 4 commits October 18, 2024 12:27

Merge branch 'main' into quantization-config

700b0f3

fix dtype checks in pipeline.

0ae70fe

fix warning message.

ecdf1d0

Update src/diffusers/models/modeling_utils.py

aea3398

Co-authored-by: YiYi Xu <[email protected]>

sayakpaul requested a review from yiyixuxu October 18, 2024 08:17

sayakpaul added 3 commits October 19, 2024 00:29

Merge branch 'main' into quantization-config

3a91974

Merge branch 'main' into quantization-config

5d8e844

mitigate the confusing cpu warning

501a6ba

sayakpaul mentioned this pull request Oct 19, 2024

[Quantization] bring quantization to diffusers core #9174

Closed

3 tasks

yiyixuxu approved these changes Oct 20, 2024

View reviewed changes

Merge branch 'main' into quantization-config

1a931cb

Merge branch 'main' into quantization-config

2fa8fb9

sayakpaul merged commit b821f00 into main Oct 21, 2024
18 checks passed

sayakpaul deleted the quantization-config branch October 21, 2024 04:42

DN6 reviewed Oct 21, 2024

View reviewed changes

sayakpaul mentioned this pull request Oct 21, 2024

[bitsandbbytes] follow-ups #9730

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Quantization] Add quantization support for `bitsandbytes` #9213

[Quantization] Add quantization support for `bitsandbytes` #9213

sayakpaul commented Aug 19, 2024 •

edited by a-r-r-o-w

Loading

HuggingFaceDocBuilderDev commented Aug 19, 2024

SunMarc left a comment

sayakpaul commented Aug 20, 2024 •

edited

Loading

sayakpaul commented Aug 22, 2024

SunMarc left a comment

yiyixuxu left a comment •

edited

Loading

sayakpaul commented Aug 22, 2024 •

edited

Loading

yiyixuxu commented Aug 22, 2024 •

edited

Loading

bghira commented Aug 22, 2024

sayakpaul commented Aug 23, 2024

SunMarc commented Aug 23, 2024 •

edited

Loading

DN6 commented Aug 28, 2024

sayakpaul commented Aug 28, 2024 •

edited

Loading

DN6 commented Aug 28, 2024

sayakpaul commented Oct 15, 2024

DN6 Oct 16, 2024 •

edited

Loading

sayakpaul Oct 16, 2024

ariG23498 commented Oct 20, 2024

yiyixuxu left a comment

sayakpaul commented Oct 21, 2024

DN6 Oct 16, 2024

DN6 Oct 16, 2024

DN6 Oct 16, 2024

DN6 Oct 16, 2024

DN6 Oct 16, 2024

Ednaordinary commented Nov 3, 2024

sayakpaul commented Nov 3, 2024

SunMarc commented Nov 4, 2024

Ednaordinary commented Nov 5, 2024



		@dataclass
		class BitsAndBytesConfig(QuantizationConfigMixin):

[Quantization] Add quantization support for bitsandbytes #9213

[Quantization] Add quantization support for bitsandbytes #9213

Conversation

sayakpaul commented Aug 19, 2024 • edited by a-r-r-o-w Loading

What does this PR do?

Notes

No-frills code snippets

HuggingFaceDocBuilderDev commented Aug 19, 2024

SunMarc left a comment

Choose a reason for hiding this comment

sayakpaul commented Aug 20, 2024 • edited Loading

sayakpaul commented Aug 22, 2024

SunMarc left a comment

Choose a reason for hiding this comment

yiyixuxu left a comment • edited Loading

Choose a reason for hiding this comment

sayakpaul commented Aug 22, 2024 • edited Loading

yiyixuxu commented Aug 22, 2024 • edited Loading

bghira commented Aug 22, 2024

sayakpaul commented Aug 23, 2024

SunMarc commented Aug 23, 2024 • edited Loading

DN6 commented Aug 28, 2024

sayakpaul commented Aug 28, 2024 • edited Loading

DN6 commented Aug 28, 2024

sayakpaul commented Oct 15, 2024

DN6 Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

sayakpaul Oct 16, 2024

Choose a reason for hiding this comment

ariG23498 commented Oct 20, 2024

yiyixuxu left a comment

Choose a reason for hiding this comment

sayakpaul commented Oct 21, 2024

DN6 Oct 16, 2024

Choose a reason for hiding this comment

DN6 Oct 16, 2024

Choose a reason for hiding this comment

DN6 Oct 16, 2024

Choose a reason for hiding this comment

DN6 Oct 16, 2024

Choose a reason for hiding this comment

DN6 Oct 16, 2024

Choose a reason for hiding this comment

Ednaordinary commented Nov 3, 2024

sayakpaul commented Nov 3, 2024

SunMarc commented Nov 4, 2024

Ednaordinary commented Nov 5, 2024

[Quantization] Add quantization support for `bitsandbytes` #9213

[Quantization] Add quantization support for `bitsandbytes` #9213

sayakpaul commented Aug 19, 2024 •

edited by a-r-r-o-w

Loading

sayakpaul commented Aug 20, 2024 •

edited

Loading

yiyixuxu left a comment •

edited

Loading

sayakpaul commented Aug 22, 2024 •

edited

Loading

yiyixuxu commented Aug 22, 2024 •

edited

Loading

SunMarc commented Aug 23, 2024 •

edited

Loading

sayakpaul commented Aug 28, 2024 •

edited

Loading

DN6 Oct 16, 2024 •

edited

Loading