[GGUF and Flux full fp16 Model] loading T5, CLIP + new VAE UI #1050
Replies: 42 comments 91 replies
-
Beta Was this translation helpful? Give feedback.
-
i think Forge Ui needs some dropdown menus like the ones i circled in red as they are choices and only one can be displayed when selected, that would help to take up less space on the screen : |
Beta Was this translation helpful? Give feedback.
-
Yeah I believe the VAE / Text Encoder when stacked, they should form a double row. Also a folder standard should be agreed on, forge is storing clip-l and t5 in models\text_encoder and ComfyUI is storing these files in models\clip this will lead to doubling up models |
Beta Was this translation helpful? Give feedback.
-
I believe this broke the Thank you for all your work. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
How to open vae/text encoder I can't find it |
Beta Was this translation helpful? Give feedback.
-
i only heard about GGUF of flux1-dev this morning and it's already here in Forge ... also read this morning LoRA work now in nf4. Many thanks @lllyasviel |
Beta Was this translation helpful? Give feedback.
-
does the clip and vae path respect the args --vae-dir and --clip-models-path?? seems no... |
Beta Was this translation helpful? Give feedback.
-
now would be great if the XYZ grid function was working to make comparisons :) |
Beta Was this translation helpful? Give feedback.
-
GGUF Q4_0 inference speed is faster than FP8 for me, though unfortunately it takes 100+ seconds to move the model/transformer each time, making the speed increase moot as a minimum of 100 seconds is added to each generation. Dunno why, when loading a FP8 Flux model, model moving for CLIP+T5/Transformer/VAE are all ~0 seconds. When introducing the Q4_0 quantization of the transformer, it takes 100-300 seconds to move the mode/transformer and begin inference. This is without Loras. I'm going to assume part of the reason is being on a low VRAM/RAM system and relying on a swap file. Though I figured loading an even smaller transformer would of been less prone to RAM/Swap related issues. |
Beta Was this translation helpful? Give feedback.
-
Has someone done a video about GGUF quants with Flux? Is it because this stuff is moving too fast? |
Beta Was this translation helpful? Give feedback.
-
I have an RTX3090 and 32GB of ram. ForgeUI crashes when I try to use the fp16 and I see in console the message "Using Default T5 Data Type: torch.float16". I can use the full precision in ComfyUI without a hitch. |
Beta Was this translation helpful? Give feedback.
-
can anyone please tell me where to download the ggufs for flux? is it the same as the ones I've seen on huggingface or are there special ones for forge? thanks in advance <3 |
Beta Was this translation helpful? Give feedback.
-
I'm facing the issue of the generated img becoming total black at last step when using GGUF checkpoint. I suspect that it's caused by the wrong vae. Which vae should be used for GGUF, Any hints? |
Beta Was this translation helpful? Give feedback.
-
I have an RTX3060, 12G and 32GB of ram. ForgeUI crashes when I try to use the full flux-dev model (23G) and fp16 and I see in console the message "Using Default T5 Data Type: torch.float16". I can use the full precision in ComfyUI without a hitch. |
Beta Was this translation helpful? Give feedback.
-
From top most message:
Advantage of supporting T5 GGUF in Forge is that I hope Forge will not crash during inference. Currently it crashes on a laptop with 32GB RAM and RTX 3070 when I tried to generate an image with Flux.dev GGUF 8_0, |
Beta Was this translation helpful? Give feedback.
-
stable-diffusion.cpp has already implement flux using ggml. https://github.com/leejet/stable-diffusion.cpp/blob/master/docs/flux.md |
Beta Was this translation helpful? Give feedback.
-
If you happen to have 12gb of vram & want to know which GGUF model is the best, i recommend Q5_K_S. It's about 11 seconds slower than NF4 on my pc, but it's way more accurate. it generates very close results to Q8 version. I have 4070 Super & 32gb DDR5 ram. |
Beta Was this translation helpful? Give feedback.
-
Is anyone successfully made the full Flux Dev model+FP16 Text encoder work at decent speed ? The first generation seems promising (1.77s/its) , however, when I try to do another generation without changing much, speed drops dramatically at 13.38s/it. I have a 24GB RTX 3090 and 32GB of CPU RAM, and I setup GPU Weights at 22100MB because when I launch SD Forge I have 23782.01 VRAM Free. Thanks for your advice ! Here are my logs :Begin to load 1 model |
Beta Was this translation helpful? Give feedback.
-
SwarmUI (ComfyUI based) is ~30% faster (30 vs 42 seconds; 2nd run) for me with GGUF model than Forge with the following settings: To load target model JointTextEncoder graffiti, the text: "Flux1 Dev Q8_0 gguf" on a white wall. version: f2.0.1v1.10.1-previous-421-g59dd981f • python: 3.10.6 • torch: 2.3.1+cu121 • xformers: N/A • gradio: 4.40.0 Windows10/64GB RAM/3090 (24GB VRAM) What can I do to get the speed of SwarmUI? |
Beta Was this translation helpful? Give feedback.
-
what torch/cuda version should we use to take advantage of your optimizations for the most speed? |
Beta Was this translation helpful? Give feedback.
-
How are you getting so many options under VAE/text encoder I just installed Forge today so completely new to this. |
Beta Was this translation helpful? Give feedback.
-
Attempting to generate using given instructions. I set up my vae and text encoders, but when I press generate my computer restarts on a blue screen. Anybody got an idea? The blue screen reads STOP CODE: Memory Management |
Beta Was this translation helpful? Give feedback.
-
Oh, so clip doesn't go in clip folder it goes in text_encoder folder. ... why not name text_encoder, clip then? |
Beta Was this translation helpful? Give feedback.
-
I have an 8GB VRAM + 16GB RAM PC. Is there anything I could do ?
|
Beta Was this translation helpful? Give feedback.
-
Thanks for this, finally I've got past the "you do not have clip state dict!" situation. Now I'm wondering if there's some incompatibility between t5xxl_fp16.safetensors, ae.safetensors and the checkpoints I tried to use, namely flux1-schnell.safetensors and realflux10b_10bTransformerDev.safetensors. I have them all in the right folders according to the instructions here and the errors are gone. But the results are garish colours and a blurry mess that's slow to generate using my GeForce 2060 (Total VRAM 6144 MB, total RAM 65457 MB). Tried altering things like CFG to no avail. When using TAESD I can see the image when it first appears, looking pretty good at the first sampling step. Then with progressive sampling steps it deteriorates, until it finishes all blurred. I need guidance on whether different versions of the other files mentioned above are required to ensure compatibility with the apparently many different versions of FLUX checkpoints. |
Beta Was this translation helpful? Give feedback.
-
Thank you Symbomatrix, I'll look into this later. Right now I just want this site to stop bombarding me with emails containing other ppl's convos and I can't find the setting due to its hideously messy, inaccessible appearance. |
Beta Was this translation helpful? Give feedback.
-
So, I'm not very technial in these matters, but what is meant by: "Now you can even load clip-l for sd1.5 separately" ?, as mentioned by @lllyasviel in the original post ? What would Clip-I do if I load it with sd1.5 ? I added it to the VAE/Text Encoder when using SD1.5 and it made no difference. In in simple words, what kind of prompt can I use with Clip-I and SD 1.5 that would make a difference ? Can someone give an example please ? |
Beta Was this translation helpful? Give feedback.
-
I’m currently trying out the Flux model, specifically
Does anyone have hints or suggestions about where to look or how to debug this further? Thank you! |
Beta Was this translation helpful? Give feedback.
-
I am trying to run a SDXL gguf model in hope to get faster speed as compare to the full fp16 model but I got this error
|
Beta Was this translation helpful? Give feedback.
-
The old Automatic1111’s user interface of VAE selection is not powerful enough for modern models.
Forge make minor modifications so that the UI is as close as possible to A1111 but also meet the demands of newer models.
New UI
For example, Stable Diffusion 1.5
Before:
After:
Before:
After:
Support All Flux Models for Ablative Experiments
Download base model and vae (raw float16) from Flux official here and here.
Download clip-l and t5-xxl from here or our mirror
Put base model in
models\Stable-diffusion
.Put vae in
models\VAE
Put clip-l and t5 in
models\text_encoder
Possible options
You can load in nearly arbitrary combinations
etc ...
Fun fact
Now you can even load clip-l for sd1.5 separately
GGUF
Download vae (raw float16, 'ae.safetensors' ) from Flux official here or here.
Download clip-l and t5-xxl from here or our mirror
Download GGUF models here or here.
Put base model in
models\Stable-diffusion
.Put vae in
models\VAE
Put clip-l and t5 in
models\text_encoder
Below are some comments copied from elsewhere
Also people need to notice that GGUF is a pure compression tech, which means it is smaller but also slower because it has extra steps to decompress tensors and computation is still pytorch. (unless someone is crazy enough to port llama.cpp compilers) (UPDATE Aug 24: Someone did it!! Congratulations to leejet for porting it to stable-diffusion.cpp here. Now people need to take a look at more possibilities for a cpp backend...)
BNB (NF4) is computational acceleration library to make things faster by replacing pytorch ops with native low-bit cuda kernels, so that computation is faster.
NF4 and Q4_0 should be very similar, with the difference that Q4_0 has smaller chunk size and NF4 has more gaussian-distributed quants. I do not recommend to trust comparisons of one or two images. And, I also want to have smaller chunk size in NF4 but it seems that bnb hard coded some thread numbers and changing that is non trivial.
However Q4_1 and Q4_K are technically granted to be more precise than NF4, but with even more computation overheads – and such overheads may be more costly than simply moving higher precision weight from CPU to GPU. If that happens then the quant lose the point.
And Q8 is always more precise than FP8 ( and a bit slower than fp8
Precision: fp16 >> Q8 > Q4
Precision For Q8: Q8_K (not available) >Q8_1 (not available) > Q8_0 >> fp8
Precision For Q4: Q4K_S >> Q4_1 > Q4_0
Precision NF4: between Q4_1 and Q4_0, may be slightly better or worse since they are in different metric system
Speed (if not offload, e.g., 80GB VRAM H100) from fast to slow: fp16 ≈ NF4 > fp8 >> Q8 > Q4_0 >> Q4_1 > Q4K_S > others
Speed (if offload, e.g., 8GB VRAM) from fast to slow: NF4 > Q4_0 > Q4_1 ≈ fp8 > Q4K_S > Q8_0 > Q8_1 > others ≈ fp16
Beta Was this translation helpful? Give feedback.
All reactions