[Major Update] BitsandBytes Guidelines and Flux #981
Replies: 164 comments 328 replies
-
By the way, for old models, forge perfectly reproduce A1111 1.10 for all advanced features like prompt grammar, break, embeddings, loras, doras etc: Forge: Automatic 1111 original 1.10 |
Beta Was this translation helpful? Give feedback.
-
I have tested this sanity check in 6GB 3050 Vram and I took 2.5 mins |
Beta Was this translation helpful? Give feedback.
-
You're ridiculously brilliant. Been following you coding run the past few days.....just updated. Really nothing to say - just other wordly skill/passion. One thing I did want to ask (mention?) is that Ostris released a new LoRA training tool and I think there is some variation about the formatting of Flux LoRA. He used a formatting that required an update by Comfy earlier today. My first Flux LoRA works on Comfy atm but not on Forge 2.0. See his tweet here which links to the PR he submitted on ComyUI: https://x.com/ostrisai/status/1822367393555030487 Error is:
If this is an easy fix I'd really appreciate it. (And if it happens to be user error on my part please forgive me) |
Beta Was this translation helpful? Give feedback.
-
please FLUX.1-schnell model convart to NF4... thanks |
Beta Was this translation helpful? Give feedback.
-
Hi there, many thanks for the work! Any chance to support FP16 Flux? |
Beta Was this translation helpful? Give feedback.
-
Could this be implemented in Comfyui? |
Beta Was this translation helpful? Give feedback.
-
How would I convert an SDXL checkpoint to NF4 to avoid the 60s quantizing the model from fp16? |
Beta Was this translation helpful? Give feedback.
-
Is this using T5? |
Beta Was this translation helpful? Give feedback.
-
Hi guys, Any clue why I'm getting different result in my sanity check? 4090 in Stability Matrix here, xformers disabled. |
Beta Was this translation helpful? Give feedback.
-
I see GOAT! |
Beta Was this translation helpful? Give feedback.
-
Is there a way to check if a GPU supports NF4? (2070 Super, to be exact.) |
Beta Was this translation helpful? Give feedback.
-
Hi everyone, after tested with more devices, the speed-ups of bnb are more random than I thought. I will update more reliable numbers later. Thanks and do not attack this! |
Beta Was this translation helpful? Give feedback.
-
Great update, thanks a lot! Inpainting/soft-inpainting with Flux doesn't work for me yet. |
Beta Was this translation helpful? Give feedback.
-
img2img does not work(( |
Beta Was this translation helpful? Give feedback.
-
Thanks a lot for your hard work, would it be possible for you to convert this to NF4 as well ? https://huggingface.co/drbaph/FLUX.1-schnell-dev-merged/tree/main pls, as it can do decent quality better than schnell in 4 steps |
Beta Was this translation helpful? Give feedback.
-
Did someone got Lora to work with GGUF Q8 Text encoder with Flux Dev on 24Gb GPU ? It crashes whatever the settings I'm using. I'm on Debian. I tried to setup the GPU Weights at 21Gb, then 16Gb, then 15Gb, it always do an Out of Memory error. I flushed my Vram and my System RAM (32Gb) before launching Forge. Any ideas? I'm using Forge in a Conda env. By the way, does the flux1-dev-bnb-nf4-v2 outperform the Full Dev model + Q8 Text encoder ? |
Beta Was this translation helpful? Give feedback.
-
I'm happy with the results, considering that my hardware is not close to optimal, RTX2060 6g, 32gb ram, ryzer 7 2700 |
Beta Was this translation helpful? Give feedback.
-
I'm getting exactly the same image, using original (not quantized) version flux1-dev and I'm concerned with it. Isn't it supposed to perform differently?) How is this possible? Astronaut in a jungle, cold color palette, muted colors, very detailed, sharp focus Update: I was too quick, zooming up showed some divergences and subtracting one from another confirmed it. But anyway the level of matching is incredible. It might be a strong argument for using quantized version. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
When I decided to change the flux1-dev-bnb-nf4-v2.safetensors model to another one, I got an error AssertionError: You do not have CLIP state dict!, please tell me how to fix it. |
Beta Was this translation helpful? Give feedback.
-
Hello every one, I am new to this topic of generating images and I am using webui_forge_cu121_torch231 and if I can generate images with the models and files that come by default but when I download from civitai models or tensor art it recognizes them if they appear in the list but when I try to generate them I get this error. ValueError: Failed to recognize model type! and in the console it looks like this (please someone help me I have months trying to fix it even resorted to chat gpt already reinstalled and configured hundreds of times the 3 versions of webui_forge) Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug 1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)] To create a public link, set During handling of the above exception, another exception occurred: Traceback (most recent call last): |
Beta Was this translation helpful? Give feedback.
-
Hello, I have an APPLE M3 PRO AND I CAN'T GET FORGE TO RUN PROPERLY. I have 36GB of RAM |
Beta Was this translation helpful? Give feedback.
-
Try Pinokio, it has Forge in one click install.
-------- Original message --------From: minye26 ***@***.***> Date: 24/10/2024 18:20 (GMT+00:00) To: lllyasviel/stable-diffusion-webui-forge ***@***.***> Cc: Xenodermus ***@***.***>, Manual ***@***.***> Subject: Re: [lllyasviel/stable-diffusion-webui-forge] [Major Update] BitsandBytes Guidelines and Flux (Discussion #981)
Hello, I have an APPLE M3 PRO AND I CAN'T GET FORGE TO RUN PROPERLY. I have 36GB of RAM
Do you have any suggestions on how I can run flux?
My regards
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Could someone confirm if these 2 models do not need T5, CLIP and VAE added? |
Beta Was this translation helpful? Give feedback.
-
who has an astronaut different from the one shown, most likely you have an updated driver or something like that, I installed the Studio version and my result became similar |
Beta Was this translation helpful? Give feedback.
-
I hope some still reads this and can help. I'm using this model: flux1-dev-bnb-nf4-v2.safetensors Full flux-dev checkpoint with main model in NF4. <- Recommended with a fresh install from the on-click installer. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
I use 8gb vram and 32gb ram, do I need to activate system swap (automatically manage paging size)? |
Beta Was this translation helpful? Give feedback.
-
(Before we start, Forge now supports UI presets like this. Click this GIF to zoom in.)
(Again, before we start, to the best of my knowledge, I
am the first one whomade the BitsandBytes low bit acceleration actually works in a real software for image diffusion. You can cite this page if you are writing a paper/survey and want to have some nf4/fp4 experiments for image diffusion models.) -> Update Aug 12: It seems that @sayakpaul is the real first one -> they made so many contributions 😹(BitsandBytes is a standard low-bit accelerator that are already used by most Large Language Models like LLama, Phi, etc)
Flux Checkpoints
The currently supported Flux checkpoints are
Alternative: If you are looking for running original raw Flux, or GGUF, or any checkpoints that need loading separated modules like clip, t5, ae, etc, go to this post instead. <- If you want to run some 6GB flux unets, also use this link.
Some info about NF4:
(i)
NF4 is significantly faster than FP8.To run Flux, NF4 is significantly faster than FP8 on 6GB/8GB/12GB devices and slightly faster for >16GB vram devices. For GPUs with 6GB/8GB VRAM, the speed-up is about 1.3x to 2.5x (pytorch 2.4, cuda 12.4) or about 1.3x to 4x (pytorch 2.1, cuda 12.1). I test 3070 ti laptop (8GB VRAM) just now, the FP8 is 8.3 seconds per iteration; NF4 is 2.15 seconds per iteration (in my case, 3.86x faster). This is because of less swap and partially because of NF4's nativebnb.matmul_4bit
rather thantorch.nn.functional.linear
: casts are avoided and computation is done with many low-bit cuda tricks. (Update 1: bnb's speed-up is less salient on pytorch 2.4, cuda 12.4. Newer pytorch may used improved fp8 cast.) (Update 2: the above number is not benchmark - I just tested very few devices. Some other devices may have different performances.) (Update 3: I just tested more devices now and the speed-up is somewhat random but I always see speed-ups - I will give more reliable numbers later!)(ii) NF4 weights are about half size of FP8.
(iii) NF4 may outperform FP8 (e4m3fn/e5m2) in numerical precision, and it does outperform e4m3fn/e5m2 in many (in fact, most) cases/benchmarks.
(iv) NF4 is technically granted to outperform FP8 (e4m3fn/e5m2) in dynamic range in 100% cases.
This is because FP8 just converts each tensor to FP8, while NF4 is a sophisticated method to convert each tensor to a combination of multiple tensors with float32, float16, uint8, int4 formats to achieve maximized approximation.
(Do not confuse FP8 with bnb-int8! In large language models, when people say "8-bits better than 4 bits", they are (mostly) talking about bnb’s 8-bit implementation, which is a more sophisticated method that also involve storing chunked float32 min/max norms. The fp8 here refers to the naked e4m3fn/e5m2 without extra norms. ) <- You can say that bnb-8bit is more precise than nf4. But e4m3fn/e5m2 may not.
For example, this is the Flux FP8 state dict (you can see that each weight only has one FP8 tensor):
This is the NF4 state dict, you can see that each weight is stored in 6 different tensors, and these tensors are in different precisions including float32, uint8, etc:
So, do not be surprised if you find out that NF4 is actually more precise than FP8 despite its smaller size. And, do not argue too much if you still find FP8 more precise in some other cases ...
(The short version of the real method is that NF4 sort values in a tensor and split in into chunks and compute the abs norm of each chunk. Such norm is stored in higher precision. The “N” in NF4 means
“Nested”(Update: "NormalFlow"), which means this compression happens two times. The first time it processes tensor weights, and the second time it processes extracted norms.) <- correct me if I am wrong(The long version of the method is here)
Using Flux
If your device supports CUDA newer than 11.7 then you can use NF4. (Most RTX 3XXX/4XXX GPUs supports this.) Congratulations. Enjoy the speed. In this case, you only need to download the
flux1-dev-bnb-nf4.safetensors
.If your device is GPU with GTX 10XX/20XX then your device may not support NF4, please download the
flux1-dev-fp8.safetensors
.In the UI, Forge gives you an option to force the loading weight type:
However, in most cases, you can just set it to
Auto
and then it will use the default precision in your downloaded checkpoint.Using this option, you can even try SDXL in nf4 and see what will happen - in my case SDXL now really works like SD1.5 fast and images are spilling out!
Important: Do Not Load FP8 checkpoint
flux1-dev-fp8.safetensors
with NF4 Option! If you do this, you will still get images eventually with no errors. However, you will waste 30 seconds to de-quantize the weight into fp16 and then waste 60 seconds quantize it again into NF4. And finally, your results are worse thanflux1-dev-bnb-nf4.safetensors
because the model is quantized twice, and quality is lost.Getting Theoretical Upper Bound of Inference Speed
Forge's default preset should already be very fast (the default preset is already among the fastest ones, with about +/- 5% speed variations, now on my several devices even using FP8, compared to all SD-related software without special packages like TensorRT, if timed using my phone outside screen).
But you can make it even faster and really get the theoretical upper bound of inference speed for every different GPU by tuning the UI.
To begin with, this is your Flux model, and lets use the FP8 model as an example, where the base diffusion model is about 11 or 12GB.
Then lets say this is your system, you have 8GB VRAM, 32GB CPU memory, and a block called "shared GPU memory" with 16GB
Then you want to move the model to GPU. But the model is 12GB, bigger than your GPU memory 8GB. Then how?
The anwser is to split the model to two parts. One part to GPU, one other to a "swap" location.
If you select CPU as swap location, your model will load to CPU memory and GPU:
If you select "Shared" as swap location, your model will load to GPU and shared memory:
On newer devices, the "Shared" offload/swap is about 15% faster than CPU swap. However, some other devices reports crash when using "Shared" swap.
Then you can select the maximum memory to load model to GPU.
Below is an example using Flux-dev in diffusion:
Another example:
Larger GPU Weights means you get faster speed. However, if the value is too large, you will fallback to some GPU problems and the speed will decrease to like 10x slower.
Smaller GPU Weights means you get slower speed. However, you will be able to diffuse larger images, because now you have more free VRAM.
Okay then the last thing, Swap Method:
Queue: you will load a layer to GPU, and then compute, and then load another layer, and then compute, ... just like it is a queue.
ASYNC: You will have two workers. One worker always compute layers. One worker always load layers to GPU. They work at the same time.
Some experiments report that the ASYNC method is 30% faster than Queue. However, there is a drawback that one worker may mistakenly moved too many layers to GPU, making the other worker cannot have enough GPU memory to compute. In that case, the speed will be suddenly 10x slower.
Based on the above info, you can now tune the best config for your device:
Once you are done, you are on the theoretical upper bound of inference speed for your device (without using methods like TensorRT that degrade compability to like some control models).
Finally, you may need to turn on system swap if it crash:
Distilled CFG Guidance
Flux-dev is a distilled model. It is recommended to set CFG=1 and then do not use negative prompts. Using “Distilled CFG Guidance” instead. The default value is 3.5.
Note that if CFG=1, the UI of negative prompt will be greyed out.
Sanity Check
Finally, use this as a sanity check! (thanks for the prompt)
Astronaut in a jungle, cold color palette, muted colors, very detailed, sharp focus
Steps: 20, Sampler: Euler, Schedule type: Simple, CFG scale: 1, Distilled CFG Scale: 3.5, Seed: 12345, Size: 896x1152, Model: flux1-dev-bnb-nf4-v2
Make sure that you get similar image:
Update 1:
Hi everyone, after tested with more devices, the speed-ups of bnb are more random than I thought. I will update more reliable numbers later. Thanks and do not attack this!
Beta Was this translation helpful? Give feedback.
All reactions