[GGUF and Flux full fp16 Model] loading T5, CLIP + new VAE UI #1050

lllyasviel · 2024-08-13T12:59:47Z

lllyasviel
Aug 13, 2024
Maintainer

The old Automatic1111’s user interface of VAE selection is not powerful enough for modern models.

Forge make minor modifications so that the UI is as close as possible to A1111 but also meet the demands of newer models.

New UI

For example, Stable Diffusion 1.5

Before:

After:

Before:

After:

Support All Flux Models for Ablative Experiments

Download base model and vae (raw float16) from Flux official here and here.

Download clip-l and t5-xxl from here or our mirror

Put base model in models\Stable-diffusion.

Put vae in models\VAE

Put clip-l and t5 in models\text_encoder

Possible options

You can load in nearly arbitrary combinations

etc ...

Fun fact

Now you can even load clip-l for sd1.5 separately

GGUF

Download vae (raw float16, 'ae.safetensors' ) from Flux official here or here.

Download clip-l and t5-xxl from here or our mirror

Download GGUF models here or here.

Put base model in models\Stable-diffusion.

Put vae in models\VAE

Put clip-l and t5 in models\text_encoder

Below are some comments copied from elsewhere

Also people need to notice that GGUF is a pure compression tech, which means it is smaller but also slower because it has extra steps to decompress tensors and computation is still pytorch. (unless someone is crazy enough to port llama.cpp compilers) (UPDATE Aug 24: Someone did it!! Congratulations to leejet for porting it to stable-diffusion.cpp here. Now people need to take a look at more possibilities for a cpp backend...)

BNB (NF4) is computational acceleration library to make things faster by replacing pytorch ops with native low-bit cuda kernels, so that computation is faster.

NF4 and Q4_0 should be very similar, with the difference that Q4_0 has smaller chunk size and NF4 has more gaussian-distributed quants. I do not recommend to trust comparisons of one or two images. And, I also want to have smaller chunk size in NF4 but it seems that bnb hard coded some thread numbers and changing that is non trivial.

However Q4_1 and Q4_K are technically granted to be more precise than NF4, but with even more computation overheads – and such overheads may be more costly than simply moving higher precision weight from CPU to GPU. If that happens then the quant lose the point.

And Q8 is always more precise than FP8 ( and a bit slower than fp8

Precision: fp16 >> Q8 > Q4
Precision For Q8: Q8_K (not available) >Q8_1 (not available) > Q8_0 >> fp8
Precision For Q4: Q4K_S >> Q4_1 > Q4_0
Precision NF4: between Q4_1 and Q4_0, may be slightly better or worse since they are in different metric system
Speed (if not offload, e.g., 80GB VRAM H100) from fast to slow: fp16 ≈ NF4 > fp8 >> Q8 > Q4_0 >> Q4_1 > Q4K_S > others
Speed (if offload, e.g., 8GB VRAM) from fast to slow: NF4 > Q4_0 > Q4_1 ≈ fp8 > Q4K_S > Q8_0 > Q8_1 > others ≈ fp16

HMRMike · 2024-08-13T15:19:27Z

HMRMike
Aug 13, 2024

Boy is float16 Flux heavy! Took some trial and error, but even with with 24GB VRAM+32GB RAM enabling Never OOM for UNET and VAE finally managed to stop OOM fails and generated something, but then only about 2GB VRAM are utilized, which is a shame.
Am I missing some other way to handle this? After some magical restarts, running fp16 without any Never OOM seems to work

I like this way of handling VAE/encoders, but it feels like allowing to save some presets would be comfortable. It gets a bit tedious after switching models and might be confusing to less experienced users- forgetting CLIP or something will error out with "TypeError: 'NoneType' object is not iterable" on a checkpoint like fp16 Flux here, for example, and figuring it out is less trivial. We do have "You do not have CLIP state dict!" but it's buried in the traceback, almost invisible. Maybe a more obvious error message will be helpful at least.

Well, fp8 so far seems most competent. For some reason fp16 messes up text more but the composition is nearly the same. nf4 deviates a lot is some cases.
4/8/16 top to bottom:

10 replies

HMRMike Aug 15, 2024

@Iory1998 Lots of commits happened since that post, and maybe something was changed for memory management. Now I make sure these settings are set right after launching Forge, if the intent is to run fp16. Also haven't dared to try anything other than 1024x1024.
Before loading- "idle" RAM load is 4-6GB.
Any other settings considering memory or optimizations- are default, no launch args except disable xformers.

Only after all is set generate a first image. So far I can reliably generate this way.
The GPU weights slider can go up to 24575 and I do see more VRAM used then. But it seems that letting it fill up like that hits performance severely, from 1.6s/it to about 4s/it, so keep the slider a bit off the end.
Also tested with T5 fp16 and it loads up, but obviously we want to help ourselves a bit with fp8 where possible, there's minimal quality difference.
Here is what a run looks like. It does behave differently now, as before all 24GB were occupied.

AnandBawa Oct 8, 2024

Hi HMRMike. I also have a similar setup. Can you tell me how to disable xformers and why should we disable it? Also, what about using the arguments "--cude-malloc" "--cuda-stream"?

HMRMike Oct 8, 2024

Hi HMRMike. I also have a similar setup. Can you tell me how to disable xformers and why should we disable it?

Hey,
Things move quite fast in this field. Back then xformers used to just error a lot so I had to disable it completely with the argument --disable-xformers even if it wasn't used.
But then they built this:
#1553
and we saw some performance boost. So now I actually do use xformers!

Also, what about using the arguments "--cude-malloc" "--cuda-stream"?

These arguments didn't seem to affect speed at all after the new Forge rebuild, so I just removed them.

AnandBawa Oct 8, 2024

Thanks for reply. So, i simply need to use the "Forge with CUDA 12.4 + Pytorch 2.4" setup and enable xformers in arguments? Or does it need more setup?

HMRMike Oct 8, 2024

It should work afterwards, unless some unforeseen dependency issues show up, but we can't really know before trying.
You will see a similar startup log mentioning xformers and a Triton error which can be safely ignored

ali0une · 2024-08-13T16:43:01Z

ali0une
Aug 13, 2024

i think Forge Ui needs some dropdown menus like the ones i circled in red as they are choices and only one can be displayed when selected, that would help to take up less space on the screen :

1 reply

ishadowx27 Aug 16, 2024

The Lobe Theme works beautifully with the exception of the VAE selection area (it is super big until you select a vae.)

RamonGuthrie · 2024-08-13T23:49:43Z

RamonGuthrie
Aug 13, 2024

Yeah I believe the VAE / Text Encoder when stacked, they should form a double row.

Also a folder standard should be agreed on, forge is storing clip-l and t5 in models\text_encoder and ComfyUI is storing these files in models\clip this will lead to doubling up models

1 reply

DiffuzionDreamer Aug 17, 2024

Until a standard exists I recommend using symlinks to avoid doubling up models.

hollowstrawberry · 2024-08-14T00:43:48Z

hollowstrawberry
Aug 14, 2024

I believe this broke the --vae-dir command line option. Also, could you add a --text-encoder-dir option instead of hardcoding models\text_encoder?

Thank you for all your work.

2 replies

evanheckert Aug 15, 2024

Yes, and with --vae-dir set, the files in the forge text_encoder folder aren't detected either.

abzaloff Aug 15, 2024

Specify additionally --clip-models-path

bews · 2024-08-14T21:32:28Z

bews
Aug 14, 2024

Can't make it work for some reason on 4090: it shows the preview during the generation, but them doesn't give the final result. No errors in the console either.

Meanwhile flux1-dev-fp8.safetensors works no problem.

2 replies

hollowstrawberry Aug 14, 2024

Are you perhaps using TAESD instead of full VAE in the settings?

bews Aug 15, 2024

Nope

Edit: I tried resetting all settings and removing everything from the command line - nothing did help.
Edit2: Reinstalled Forge from scratch - same result. It is broken somewhere.
Edit3: I've downloaded the wrong VAE (from VAE directory), solved.

Githb-alexsherman · 2024-08-15T07:52:15Z

Githb-alexsherman
Aug 15, 2024

How to open vae/text encoder I can't find it

1 reply

Iory1998 Aug 15, 2024

Me too! It seems the T5, CLIP are not detected isn text_encoder folder

ali0une · 2024-08-15T08:42:00Z

ali0une
Aug 15, 2024

i only heard about GGUF of flux1-dev this morning and it's already here in Forge ... also read this morning LoRA work now in nf4.
i'm amazed.

Many thanks @lllyasviel

4 replies

RamonGuthrie Aug 15, 2024

Where did you read LoRAs working with nf4 models?

supersonic13 Aug 15, 2024

In this discussion: #1038

ali0une Aug 15, 2024

~~reddit~~ here, i've just tested ~~and LoRA are still not working in nf4~~ i really don't know what to think, only the art_comfy_converted LoRA seems to produce an effect :-|

Left is no LoRA, right is with _comfy_converted LoRA :

woman taking a selfie art

woman taking a selfie

woman taking a selfie anime

ali0une Aug 15, 2024

In this discussion: #1038

i think the LoRA strength must be > 1, with 1.5 it clearly makes a difference.

Left is no LoRA, right is with realism_comfy_converted LoRA :

woman taking a selfie

yamfun · 2024-08-15T13:36:44Z

yamfun
Aug 15, 2024

does the clip and vae path respect the args --vae-dir and --clip-models-path?? seems no...

2 replies

jepjoo Aug 15, 2024

Did not work for me. I resorted to using symbolic links (in Windows).

Also the fact that Forge wants checkpoint and unet files in the same folder but ComfyUI separates them into two different ones is slightly cumbersome as I use ComfyUI installation for storing all the actual model files.

ishadowx27 Aug 15, 2024

This is why I use StabilityMatrix since it manages everything, including the downloading of models from different sites.

tazztone · 2024-08-15T13:53:10Z

tazztone
Aug 15, 2024

now would be great if the XYZ grid function was working to make comparisons :)
atm it doesn't switch checkpoints

0 replies

rabidcopy · 2024-08-15T16:28:47Z

rabidcopy
Aug 15, 2024

GGUF Q4_0 inference speed is faster than FP8 for me, though unfortunately it takes 100+ seconds to move the model/transformer each time, making the speed increase moot as a minimum of 100 seconds is added to each generation. Dunno why, when loading a FP8 Flux model, model moving for CLIP+T5/Transformer/VAE are all ~0 seconds. When introducing the Q4_0 quantization of the transformer, it takes 100-300 seconds to move the mode/transformer and begin inference. This is without Loras. I'm going to assume part of the reason is being on a low VRAM/RAM system and relying on a swap file. Though I figured loading an even smaller transformer would of been less prone to RAM/Swap related issues.

1 reply

katarzynasornat Oct 11, 2024

Have the same issue here with Q4, is there an explanation known why? Can we just inference without reloading all components from scratch each time?

RamonGuthrie · 2024-08-15T16:57:19Z

RamonGuthrie
Aug 15, 2024

Has someone done a video about GGUF quants with Flux? Is it because this stuff is moving too fast?

0 replies

Iory1998 · 2024-08-15T21:22:48Z

Iory1998
Aug 15, 2024

I have an RTX3090 and 32GB of ram. ForgeUI crashes when I try to use the fp16 and I see in console the message "Using Default T5 Data Type: torch.float16". I can use the full precision in ComfyUI without a hitch.

1 reply

legato67 Aug 16, 2024

I have this issue too. Hope someone give a solution😕

queenofinvidia · 2024-08-15T23:06:50Z

queenofinvidia
Aug 15, 2024

can anyone please tell me where to download the ggufs for flux? is it the same as the ones I've seen on huggingface or are there special ones for forge?

thanks in advance <3

2 replies

Iory1998 Aug 15, 2024

Just a quick search on HF
https://huggingface.co/city96/FLUX.1-dev-gguf/tree/main

queenofinvidia Aug 16, 2024

Thank you lory!

rockiecxh · 2024-08-16T04:08:26Z

rockiecxh
Aug 16, 2024

I'm facing the issue of the generated img becoming total black at last step when using GGUF checkpoint. I suspect that it's caused by the wrong vae. Which vae should be used for GGUF, Any hints?

4 replies

rockiecxh Aug 16, 2024

Model selected: {'checkpoint_info': {'filename': 'H:\stable-diffusion-webui\models\Stable-diffusion\flux1-dev-bnb-nf4-v2.safetensors', 'hash': 'f0770152'}, 'additional_modules': [], 'unet_storage_dtype': None}
Loading Model: {'checkpoint_info': {'filename': 'H:\stable-diffusion-webui\models\Stable-diffusion\flux1-dev-bnb-nf4-v2.safetensors', 'hash': 'f0770152'}, 'additional_modules': [], 'unet_storage_dtype': None}
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "H:\stable-diffusion-webui\venv\lib\site-packages\uvicorn\protocols\http\h11_impl.py", line 406, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "H:\stable-diffusion-webui\venv\lib\site-packages\uvicorn\middleware\proxy_headers.py", line 70, in call
return await self.app(scope, receive, send)
File "H:\stable-diffusion-webui\venv\lib\site-packages\fastapi\applications.py", line 1106, in call
await super().call(scope, receive, send)
File "H:\stable-diffusion-webui\venv\lib\site-packages\starlette\applications.py", line 122, in call
await self.middleware_stack(scope, receive, send)
File "H:\stable-diffusion-webui\venv\lib\site-packages\starlette\middleware\errors.py", line 184, in call
raise exc
File "H:\stable-diffusion-webui\venv\lib\site-packages\starlette\middleware\errors.py", line 162, in call
await self.app(scope, receive, _send)
File "H:\stable-diffusion-webui\venv\lib\site-packages\gradio\route_utils.py", line 730, in call
await self.simple_response(scope, receive, send, request_headers=headers)
File "H:\stable-diffusion-webui\venv\lib\site-packages\gradio\route_utils.py", line 746, in simple_response
await self.app(scope, receive, send)
File "H:\stable-diffusion-webui\venv\lib\site-packages\starlette\middleware\exceptions.py", line 79, in call
raise exc
File "H:\stable-diffusion-webui\venv\lib\site-packages\starlette\middleware\exceptions.py", line 68, in call
await self.app(scope, receive, sender)
File "H:\stable-diffusion-webui\venv\lib\site-packages\fastapi\middleware\asyncexitstack.py", line 20, in call
raise e
File "H:\stable-diffusion-webui\venv\lib\site-packages\fastapi\middleware\asyncexitstack.py", line 17, in call
await self.app(scope, receive, send)
File "H:\stable-diffusion-webui\venv\lib\site-packages\starlette\routing.py", line 718, in call
await route.handle(scope, receive, send)
File "H:\stable-diffusion-webui\venv\lib\site-packages\starlette\routing.py", line 276, in handle
await self.app(scope, receive, send)
File "H:\stable-diffusion-webui\venv\lib\site-packages\starlette\routing.py", line 66, in app
response = await func(request)
File "H:\stable-diffusion-webui\venv\lib\site-packages\fastapi\routing.py", line 274, in app
raw_response = await run_endpoint_function(
File "H:\stable-diffusion-webui\venv\lib\site-packages\fastapi\routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
File "H:\stable-diffusion-webui\extensions\sd-webui-prompt-all-in-one\scripts\on_app_started.py", line 108, in _token_counter
return get_token_counter(data['text'], data['steps'])
File "H:\stable-diffusion-webui\extensions\sd-webui-prompt-all-in-one\scripts\physton_prompt\get_token_counter.py", line 30, in get_token_counter
cond_stage_model = sd_models.model_data.sd_model.cond_stage_model
AttributeError: 'NoneType' object has no attribute 'cond_stage_model'

ali0une Aug 16, 2024

The VAE is for fp8 (optional)/fp16/GGUF it's the ae.safetensors in this repository https://huggingface.co/black-forest-labs/FLUX.1-dev/tree/main

But you're using flux1-dev-bnb-nf4-v2.safetensors, you don't need this VAE i guess this is what makes this error happen.

rockiecxh Aug 16, 2024

Thanks a lot, I was confused by the ae.safetensors and the one in the vae folder.

lunar-studio Aug 20, 2024

I’m brain dead. You say optional but what is included with FP8 by default if you don’t use it? What does it (or doesn’t) understand?

legato67 · 2024-08-16T08:34:06Z

legato67
Aug 16, 2024

I have an RTX3060, 12G and 32GB of ram. ForgeUI crashes when I try to use the full flux-dev model (23G) and fp16 and I see in console the message "Using Default T5 Data Type: torch.float16". I can use the full precision in ComfyUI without a hitch.

16 replies

Iory1998 Aug 16, 2024

@HMRMike No, at least you have 1.7GB of shared memory, I have practically 0.1 all the time. Using the FP16, I can now generate images until it reached 95%, then my RAM gets 100% utilized and my system freezes.
Using GGUF_q8_0:

And when I am generating with it:

My system crashes if I use Async with GGUF_q8_0 as you can see from the screenshot:

As you can see, using CPU or shared Memory doesn't change a thing.

Using the f16 just returns an out of Memory error:

Iory1998 Aug 16, 2024

I finally solved the issue. Guys set Virtual Memory to at least 40GB that worked for me.

Here is my first Image generated at fp16 in ForgeUI. :D

P.S., what did the model unloaded after 1st generation?

Why the Kmodel is bein unloaded once the image is generated?

HMRMike Aug 17, 2024

Wow you're right!
I set mine to system managed ages ago and forgot all about it. Limiting manually under 40 resulted in OOM errors and system hangs every time.
During generation Windows inflates it up to 45GB but there are barely any actual writes to the hard drive.

legato67 Aug 17, 2024

I finally solved the issue. Guys set Virtual Memory to at least 40GB that worked for me. Here is my first Image generated at fp16 in ForgeUI. :D

P.S., what did the model unloaded after 1st generation?

Why the Kmodel is bein unloaded once the image is generated?

hey mate, I did this and increased my virtual memory to 35000, but for now the error has changed, I get this error that says out of memory. it tries to load whole 22G model to GPU that I have 12G how can I fix this? I tried both SHARED and CPU mode. full log is: Begin to load 1 model
[Memory Management] Current Free GPU Memory: 11185.99 MB
[Memory Management] Required Model Memory: 22700.13 MB
[Memory Management] Required Inference Memory: 1024.00 MB
[Memory Management] Estimated Remaining GPU Memory: -12538.14 MB
Traceback (most recent call last):
File "E:\forge\stable-diffusion-webui-forge\modules_forge\main_thread.py", line 30, in work
self.result = self.func(*self.args, **self.kwargs)
File "E:\forge\stable-diffusion-webui-forge\modules\txt2img.py", line 110, in txt2img_function
processed = processing.process_images(p)
File "E:\forge\stable-diffusion-webui-forge\modules\processing.py", line 809, in process_images
res = process_images_inner(p)
File "E:\forge\stable-diffusion-webui-forge\modules\processing.py", line 952, in process_images_inner
samples_ddim = p.sample(conditioning=p.c, unconditional_conditioning=p.uc, seeds=p.seeds, subseeds=p.subseeds, subseed_strength=p.subseed_strength, prompts=p.prompts)
File "E:\forge\stable-diffusion-webui-forge\modules\processing.py", line 1323, in sample
samples = self.sampler.sample(self, x, conditioning, unconditional_conditioning, image_conditioning=self.txt2img_image_conditioning(x))
File "E:\forge\stable-diffusion-webui-forge\modules\sd_samplers_kdiffusion.py", line 194, in sample
sampling_prepare(self.model_wrap.inner_model.forge_objects.unet, x=x)
File "E:\forge\stable-diffusion-webui-forge\backend\sampling\sampling_function.py", line 356, in sampling_prepare
memory_management.load_models_gpu(
File "E:\forge\stable-diffusion-webui-forge\backend\memory_management.py", line 571, in load_models_gpu
loaded_model.model_load(model_gpu_memory_when_using_cpu_swap)
File "E:\forge\stable-diffusion-webui-forge\backend\memory_management.py", line 402, in model_load
m._apply(lambda x: x.pin_memory())
File "E:\forge\stable-diffusion-webui-forge\venv\lib\site-packages\torch\nn\modules\module.py", line 833, in _apply
param_applied = fn(param)
File "E:\forge\stable-diffusion-webui-forge\backend\memory_management.py", line 402, in
m._apply(lambda x: x.pin_memory())
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Environment vars changed: {'stream': False, 'inference_memory': 1024.0, 'pin_shared_memory': False}

Iory1998 Aug 17, 2024

I finally solved the issue. Guys set Virtual Memory to at least 40GB that worked for me. Here is my first Image generated at fp16 in ForgeUI. :D
P.S., what did the model unloaded after 1st generation?
Why the Kmodel is bein unloaded once the image is generated?

hey mate, I did this and increased my virtual memory to 35000, but for now the error has changed, I get this error that says out of memory. it tries to load whole 22G model to GPU that I have 12G how can I fix this? I tried both SHARED and CPU mode. full log is: Begin to load 1 model [Memory Management] Current Free GPU Memory: 11185.99 MB [Memory Management] Required Model Memory: 22700.13 MB [Memory Management] Required Inference Memory: 1024.00 MB [Memory Management] Estimated Remaining GPU Memory: -12538.14 MB Traceback (most recent call last): File "E:\forge\stable-diffusion-webui-forge\modules_forge\main_thread.py", line 30, in work self.result = self.func(*self.args, **self.kwargs) File "E:\forge\stable-diffusion-webui-forge\modules\txt2img.py", line 110, in txt2img_function processed = processing.process_images(p) File "E:\forge\stable-diffusion-webui-forge\modules\processing.py", line 809, in process_images res = process_images_inner(p) File "E:\forge\stable-diffusion-webui-forge\modules\processing.py", line 952, in process_images_inner samples_ddim = p.sample(conditioning=p.c, unconditional_conditioning=p.uc, seeds=p.seeds, subseeds=p.subseeds, subseed_strength=p.subseed_strength, prompts=p.prompts) File "E:\forge\stable-diffusion-webui-forge\modules\processing.py", line 1323, in sample samples = self.sampler.sample(self, x, conditioning, unconditional_conditioning, image_conditioning=self.txt2img_image_conditioning(x)) File "E:\forge\stable-diffusion-webui-forge\modules\sd_samplers_kdiffusion.py", line 194, in sample sampling_prepare(self.model_wrap.inner_model.forge_objects.unet, x=x) File "E:\forge\stable-diffusion-webui-forge\backend\sampling\sampling_function.py", line 356, in sampling_prepare memory_management.load_models_gpu( File "E:\forge\stable-diffusion-webui-forge\backend\memory_management.py", line 571, in load_models_gpu loaded_model.model_load(model_gpu_memory_when_using_cpu_swap) File "E:\forge\stable-diffusion-webui-forge\backend\memory_management.py", line 402, in model_load m._apply(lambda x: x.pin_memory()) File "E:\forge\stable-diffusion-webui-forge\venv\lib\site-packages\torch\nn\modules\module.py", line 833, in _apply param_applied = fn(param) File "E:\forge\stable-diffusion-webui-forge\backend\memory_management.py", line 402, in m._apply(lambda x: x.pin_memory()) RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Environment vars changed: {'stream': False, 'inference_memory': 1024.0, 'pin_shared_memory': False}

Ah write, I haven't tried the new commit, but what I did is keep hitting generate. What it seems to do is reload models randomly until it finds the write order which works. That what I kept doing. However, just use the GGUF.Q8 since it's basically identical to the full precision. I did some comparison on that. I'll post it below.

morganavr · 2024-08-23T19:27:42Z

morganavr
Aug 23, 2024

From top most message:

GGUF Text Encoder
We are still evaluating whether it is necessary to support GGUF T5. All GGUF T5 models are just approximations for t5xxl_fp16.safetensors, which means all GGUF T5 models are worse than t5xxl_fp16.safetensors. And, in almost all my 5 test devices, loading t5xxl_fp16.safetensors from CPU and then offload is faster than all GGUF T5 models. Before we find any real advantages for GGUF T5 text encoders, you can only use t5xxl_fp16.safetensors or t5xxl_fp8_e4m3fn.safetensors in Forge.

Advantage of supporting T5 GGUF in Forge is that I hope Forge will not crash during inference. Currently it crashes on a laptop with 32GB RAM and RTX 3070 when I tried to generate an image with Flux.dev GGUF 8_0, clip_l.safetensors, ae.safetensors and t5xxl_fp16.safetensors. I think huge file size of t5xxl_fp16.safetensors is the reason - it weights almost 10GB.

5 replies

lllyasviel Aug 23, 2024
Maintainer Author

#1419 (comment)

morganavr Aug 24, 2024

#1419 (comment)

I have 80GB free space and pagefile size was set to auto but Forge crashes.

On the other hand, gguf checkpoint and gguf T5 encoder work fine in ComfyUI therefore I assume that the issue lies in Forge memory management.

lllyasviel Aug 24, 2024
Maintainer Author

Your screenshot exactly shows opposite thing: you have at least one hard drive with less than 40gb free space. and you should try to restart computer after setting page file.

wisdombox44 Aug 24, 2024

Hi, @lllyasviel
My Unet models stored in C:\SD_AI\ComfyUI\ComfyUI\models\unet
and Clip stored in C:\SD_AI\ComfyUI\ComfyUI\models\clip
Forge can not see them
I have managed to direct ComfyUI checkpoints and loras to Forge
set COMMANDLINE_ARGS= --theme dark --ckpt-dir "C:/SD_AI/ComfyUI/ComfyUI/models/checkpoints" --lora-dir "C:\SD_AI\ComfyUI\ComfyUI\models\loras" --models-dir "C:/SD_AI/ComfyUI/ComfyUI/models" --vae-path "C:/SD_AI/ComfyUI/ComfyUI/models/vae"

I there a way to direct unet models and clips to Forge
I have tried many ARG but none works
Thank you

klingki Aug 24, 2024

Unless it's a crazy ton of work to implement it (I don't really know about the technicalities of these things), I feel it wouldn't really hurt to add it into Forge. It's nice for people to have lots of options for how they want to run their workflows, and theoretically these t5 ggufs should load faster, even if you didn't notice a huge difference in your tests. For me, there is a significant delay between pushing the Generate button and generation actually starting (on a 3070ti 8GB laptop). I'm assuming it's at least partially due to having to load the giant text encoder.

leejet · 2024-08-24T11:02:22Z

leejet
Aug 24, 2024

unless someone is crazy enough to port llama.cpp compilers

stable-diffusion.cpp has already implement flux using ggml.

https://github.com/leejet/stable-diffusion.cpp/blob/master/docs/flux.md

1 reply

lllyasviel Aug 24, 2024
Maintainer Author

wow just wow. This needs more attentions! I edited the post just now. Congratulations!

Melyns · 2024-08-24T17:41:06Z

Melyns
Aug 24, 2024

If you happen to have 12gb of vram & want to know which GGUF model is the best, i recommend Q5_K_S. It's about 11 seconds slower than NF4 on my pc, but it's way more accurate. it generates very close results to Q8 version. I have 4070 Super & 32gb DDR5 ram.

0 replies

tomakorea · 2024-08-24T21:34:03Z

tomakorea
Aug 24, 2024

Is anyone successfully made the full Flux Dev model+FP16 Text encoder work at decent speed ? The first generation seems promising (1.77s/its) , however, when I try to do another generation without changing much, speed drops dramatically at 13.38s/it. I have a 24GB RTX 3090 and 32GB of CPU RAM, and I setup GPU Weights at 22100MB because when I launch SD Forge I have 23782.01 VRAM Free.
To maximize VRAM I'm on Linux terminal, I'm controlling SDForge through the network on another computer using the -listen args.
Swap Method : Queue
Swap Location : CPU

Thanks for your advice !

Here are my logs :

Begin to load 1 model
[Unload] Trying to free 4036.35 MB for cuda:0 with 0 models keep loaded ...
[Unload] Current free memory is 3116.39 MB ...
[Unload] Unload model KModel
[Memory Management] Current Free GPU Memory: 23782.01 MB
[Memory Management] Required Model Memory: 159.87 MB
[Memory Management] Required Inference Memory: 1858.00 MB
[Memory Management] Estimated Remaining GPU Memory: 21764.14 MB
Moving model(s) has taken 10.29 seconds
Total progress: 100%|██████████| 20/20 [00:35<00:00, 1.77s/it]
Skipping unconditional conditioning when CFG = 1. Negative Prompts are ignored.
To load target model JointTextEncoder
Begin to load 1 model
[Unload] Trying to free 14392.57 MB for cuda:0 with 0 models keep loaded ...
[Unload] Current free memory is 23611.01 MB ...
[Memory Management] Current Free GPU Memory: 23611.01 MB
[Memory Management] Required Model Memory: 9641.98 MB
[Memory Management] Required Inference Memory: 1858.00 MB
[Memory Management] Estimated Remaining GPU Memory: 12111.03 MB
Moving model(s) has taken 28.79 seconds
Distilled CFG Scale: 2.5
To load target model KModel
Begin to load 1 model
[Unload] Trying to free 31368.18 MB for cuda:0 with 0 models keep loaded ...
[Unload] Current free memory is 13964.15 MB ...
[Unload] Unload model IntegratedAutoencoderKL
[Unload] Current free memory is 14124.02 MB ...
[Unload] Unload model JointTextEncoder
[Memory Management] Current Free GPU Memory: 23766.01 MB
[Memory Management] Required Model Memory: 22700.13 MB
[Memory Management] Required Inference Memory: 1858.00 MB
[Memory Management] Estimated Remaining GPU Memory: -792.13 MB
[Memory Management] Loaded to CPU Swap: 2106.52 MB (blocked method)
[Memory Management] Loaded to GPU: 20593.58 MB
Moving model(s) has taken 38.16 seconds
100%|██████████| 20/20 [04:27<00:00, 13.38s/it]

2 replies

lllyasviel Aug 24, 2024
Maintainer Author

the log shows that your ram transition is somewhat broken.
if you are running other program that is using lots of ram at the same time, then it is possible that the broken ram transition is because of competing ram or your ram fall to SWAP on HDD hard drive rather than SSD.
if you are not running other program at all, then your ram is broken. you need to try smaller models to avoid using ram, or just buy a new ram. There is also a chance that your RAM is good but motherboard bus on ram is broken, but not very likely

tomakorea Aug 25, 2024

the log shows that your ram transition is somewhat broken. if you are running other program that is using lots of ram at the same time, then it is possible that the broken ram transition is because of competing ram or your ram fall to SWAP on HDD hard drive rather than SSD. if you are not running other program at all, then your ram is broken. you need to try smaller models to avoid using ram, or just buy a new ram. There is also a chance that your RAM is good but motherboard bus on ram is broken, but not very likely

I'm using only SD Forge after a fresh start, so could it be because I'm using Linux ? (memory management on Linux is maybe different than the one on windows). I did a memory test with memcheck on windows using the same machine (on another partition) and it didn't reports any errors..

a3nima · 2024-08-25T09:40:09Z

a3nima
Aug 25, 2024

SwarmUI (ComfyUI based) is ~30% faster (30 vs 42 seconds; 2nd run) for me with GGUF model than Forge with the following settings:

To load target model JointTextEncoder
Begin to load 1 model
[Unload] Trying to free 17048.57 MB for cuda:0 with 0 models keep loaded ...
[Unload] Current free memory is 10812.68 MB ...
[Unload] Unload model KModel
[Unload] Current free memory is 22977.95 MB ...
[Memory Management] Current Free GPU Memory: 22977.95 MB
[Memory Management] Required Model Memory: 9641.98 MB
[Memory Management] Required Inference Memory: 4514.00 MB
[Memory Management] Estimated Remaining GPU Memory: 8821.97 MB
Moving model(s) has taken 5.99 seconds
Distilled CFG Scale: 3.5
To load target model KModel
Begin to load 1 model
[Unload] Trying to free 20269.41 MB for cuda:0 with 0 models keep loaded ...
[Unload] Current free memory is 13324.96 MB ...
[Unload] Unload model IntegratedAutoencoderKL
[Unload] Current free memory is 13490.78 MB ...
[Unload] Unload model JointTextEncoder
[Memory Management] Current Free GPU Memory: 23138.77 MB
[Memory Management] Required Model Memory: 12119.55 MB
[Memory Management] Required Inference Memory: 4514.00 MB
[Memory Management] Estimated Remaining GPU Memory: 6505.22 MB
Moving model(s) has taken 4.90 seconds
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:29<00:00, 1.49s/it]
To load target model IntegratedAutoencoderKL███████████████████████████████████████████| 20/20 [00:29<00:00, 1.55s/it]
Begin to load 1 model
[Unload] Trying to free 4721.84 MB for cuda:0 with 0 models keep loaded ...
[Unload] Current free memory is 10969.87 MB ...
[Memory Management] Current Free GPU Memory: 10969.87 MB
[Memory Management] Required Model Memory: 159.87 MB
[Memory Management] Required Inference Memory: 4514.00 MB
[Memory Management] Estimated Remaining GPU Memory: 6295.99 MB
Moving model(s) has taken 1.28 seconds
Total progress: 100%|██████████████████████████████████████████████████████████████████| 20/20 [00:31<00:00, 1.57s/it]
Total progress: 100%|██████████████████████████████████████████████████████████████████| 20/20 [00:31<00:00, 1.55s/it]

graffiti, the text: "Flux1 Dev Q8_0 gguf" on a white wall.
Steps: 20, Sampler: Euler, Schedule type: Simple, CFG scale: 1, Distilled CFG Scale: 3.5, Seed: 2912692951, Size: 1024x1024, Model: flux1-dev-Q8_0, Version: f2.0.1v1.10.1-previous-421-g59dd981f, Module 1: clip_l, Module 2: t5xxl_fp16, Module 3: ae

version: f2.0.1v1.10.1-previous-421-g59dd981f • python: 3.10.6 • torch: 2.3.1+cu121 • xformers: N/A • gradio: 4.40.0

Windows10/64GB RAM/3090 (24GB VRAM)

What can I do to get the speed of SwarmUI?

2 replies

lllyasviel Aug 25, 2024
Maintainer Author

I cannot reproduce, but you can update and try again

kenny00968 Aug 27, 2024

对于使用 GGUF 模型的我来说，SwarmUI（基于 ComfyUI）比使用以下设置的 Forge 快 ~30%（30 秒对 42 秒;第二次运行）：

加载目标模型 JointTextEncoder 开始加载 1 个模型 [卸载] 尝试为 cuda：0 释放 17048.57 MB 内存，其中 0 个模型保持加载状态... [卸载] 当前可用内存为 10812.68 MB ... [Unload] 卸载模型 KModel [Unload] 当前可用内存为 22977.95 MB ... [内存管理] 当前可用 GPU 内存：22977.95 MB [内存管理] 所需模型内存：9641.98 MB [内存管理] 所需推理内存：4514.00 MB [内存管理] 估计剩余 GPU 内存：8821.97 MB 移动模型需要 5.99 秒提炼的 CFG 刻度：3.5 加载目标模型 KModel 开始加载 1 个模型[卸载]尝试为 cuda：0 释放 20269.41 MB，其中 0 个模型保持加载... [卸载] 当前可用内存为 13324.96 MB ... [卸载] 卸载模型 IntegratedAutoencoderKL [卸载] 当前可用内存为 13490.78 MB ... [卸载] 卸载模型 JointTextEncoder [内存管理] 当前可用 GPU 内存：23138.77 MB [内存管理] 所需模型内存：12119.55 MB [内存管理] 所需推理内存：4514.00 MB [内存管理] 估计剩余 GPU 内存：6505.22 MB 移动模型需要 4.90 秒100%|██████████████████████████████████████████████████████████████████████████████████|20/20 [00：29<00：00， 1.49s/it] 加载目标模型 IntegratedAutoencoderKL███████████████████████████████████████████|20/20 [00：29<00：00， 1.55s/it] 开始加载 1 个模型 [卸载] 尝试为 cuda：0 释放 4721.84 MB，其中 0 个模型保持加载状态... [卸载] 当前可用内存为 10969.87 MB ... [内存管理] 当前可用 GPU 内存：10969.87 MB [内存管理] 所需模型内存：159.87 MB [内存管理] 所需推理内存：4514.00 MB [内存管理] 预计剩余 GPU 内存：6295.99 MB 移动模型需要 1.28 秒总进度：100%|██████████████████████████████████████████████████████████████████|20/20 [00：31<00：00， 1.57s/it] 总进度：100%|██████████████████████████████████████████████████████████████████|20/20 [00：31<00：00， 1.55 秒/它]

涂鸦，白色墙壁上的文字：“Flux1 Dev Q8_0 gguf”。步骤：20，采样器：Euler，计划类型：简单，CFG比例：1，蒸馏CFG比例：3.5，种子：2912692951，大小：1024x1024，型号：flux1-dev-Q8_0，版本：f2.0.1v1.10.1-previous-421-g59dd981f，模块1：clip_l，模块2：t5xxl_fp16，模块3：ae

版本： f2.0.1v1.10.1-previous-421-g59dd981f • python： 3.10.6 • torch： 2.3.1+cu121 • xformers： N/A • gradio： 4.40.0

Windows10/64GB RAM/3090（24GB 显存）

我该怎么做才能获得 SwarmUI 的速度？

能请教一下，如何在sd webui Forge中安装clip和T5吗？

yamfun · 2024-08-27T05:06:47Z

yamfun
Aug 27, 2024

what torch/cuda version should we use to take advantage of your optimizations for the most speed?

0 replies

nitinmukesh · 2024-09-10T12:54:31Z

nitinmukesh
Sep 10, 2024

@lllyasviel

How are you getting so many options under VAE/text encoder

I only have this

I just installed Forge today so completely new to this.

2 replies

Iory1998 Sep 10, 2024

Click inside the VAE/TEXT Encoder box, then select the ae.safetensors, clip_l.safetensors, and t5xxl_fp16.safetensors.

nitinmukesh Sep 10, 2024

Thank you

TF-Art · 2024-09-10T16:55:14Z

TF-Art
Sep 10, 2024

Attempting to generate using given instructions. I set up my vae and text encoders, but when I press generate my computer restarts on a blue screen. Anybody got an idea? The blue screen reads STOP CODE: Memory Management

1 reply

Iory1998 Sep 10, 2024

Attempting to generate using given instructions. I set up my vae and text encoders, but when I press generate my computer restarts on a blue screen. Anybody got an idea? The blue screen reads STOP CODE: Memory Management

Try to read this guide first and make sure the system swap is activated. Windows will create virtual memory.
#981

webmaster-exit-1 · 2024-09-14T07:32:36Z

webmaster-exit-1
Sep 14, 2024

Oh, so clip doesn't go in clip folder it goes in text_encoder folder. ... why not name text_encoder, clip then?

1 reply

Iory1998 Sep 14, 2024

Because Clip is a type of a text-encoder different from T5.

miiguelkf · 2024-09-18T01:03:17Z

miiguelkf
Sep 18, 2024

I have an 8GB VRAM + 16GB RAM PC.
SDXL models are working great, but I couldnt even make a single generation using Flux because it's just taking sooooo long

Is there anything I could do ?


> Model loaded in 1.5s (unload existing model: 0.1s, forge model load: 1.4s).
> Skipping unconditional conditioning when CFG = 1. Negative Prompts are ignored.
> [Unload] Trying to free 14569.34 MB for cuda:0 with 0 models keep loaded ... Done.
> [Memory Management] Target: JointTextEncoder, Free GPU: 5095.00 MB, Model Require: 9569.49 MB, Previously Loaded: 0.00 MB, Inference Require: 2129.00 MB, Remaining: -6603.49 MB, CPU Swap Loaded (blocked method): 7390.38 MB, GPU Loaded: 2251.49 MB
> Moving model(s) has taken 2.63 seconds
> Distilled CFG Scale will be ignored for Schnell
> [Unload] Trying to free 31613.81 MB for cuda:0 with 0 models keep loaded ... Current free memory is 2226.55 MB ... Unload model JointTextEncoder Done.
> [Memory Management] Target: KModel, Free GPU: 5055.66 MB, Model Require: 22680.62 MB, Previously Loaded: 0.00 MB, Inference Require: 2129.00 MB, Remaining: -19753.96 MB, CPU Swap Loaded (blocked method): 20430.00 MB, GPU Loaded: 2250.62 MB

4 replies

nitinmukesh Sep 18, 2024

try increasing the GPU weight to max. I think it should be around 8xxx. Or simply put 10000 in GPU weight and it will automatically replace with the max value.

toyssamurai Oct 3, 2024

16GB RAM has little usage here. It's your 8Gb VRAM that counts. Are you using your GPU for other things at the same time. I noticed that when I have PhotoShop running in the background, every once in a while, the system would freeze for a couple second, probably to unload things from the VRAM.

al-swaiti Oct 5, 2024

Do you use quantized model nf4 , or gguf , fp8 ?!

al-swaiti Oct 5, 2024

Try https://huggingface.co/lllyasviel/flux1-dev-bnb-nf4 or this https://huggingface.co/city96/FLUX.1-dev-gguf both tested on low vram with amazing results !

ShroomDream · 2024-10-05T16:11:00Z

ShroomDream
Oct 5, 2024

Thanks for this, finally I've got past the "you do not have clip state dict!" situation.

Now I'm wondering if there's some incompatibility between t5xxl_fp16.safetensors, ae.safetensors and the checkpoints I tried to use, namely flux1-schnell.safetensors and realflux10b_10bTransformerDev.safetensors.

I have them all in the right folders according to the instructions here and the errors are gone. But the results are garish colours and a blurry mess that's slow to generate using my GeForce 2060 (Total VRAM 6144 MB, total RAM 65457 MB).

Tried altering things like CFG to no avail. When using TAESD I can see the image when it first appears, looking pretty good at the first sampling step. Then with progressive sampling steps it deteriorates, until it finishes all blurred.

I need guidance on whether different versions of the other files mentioned above are required to ensure compatibility with the apparently many different versions of FLUX checkpoints.

3 replies

XPHUPZHJ Oct 13, 2024

hi, how did you get past "missing clip state dict"?

ShroomDream Oct 16, 2024

I had to get three files and put them in specific folders. I found out how on here. I searched for "missing clip state dict?" and found the instructions here, at the top of this page.

Symbiomatrix Oct 18, 2024

I had the blur issue at first, until I set cfg=1 and distilled cfg=normal values. It's noted in another guide, but not stressed how important that is. I use clip_l, t5 fp8, ae, and for unet anything below fp16 usually works. Low bits=automatic fp16 lora also important. On 6gb vram it'd probably be difficult to load up almost anything though, certainly not t5 fp16.

ShroomDream · 2024-10-24T12:37:46Z

ShroomDream
Oct 24, 2024

Thank you Symbomatrix, I'll look into this later. Right now I just want this site to stop bombarding me with emails containing other ppl's convos and I can't find the setting due to its hideously messy, inaccessible appearance.

1 reply

HMRMike Oct 24, 2024

There should be an unsubscribe link at the bottom of each mail message

aggregate15 · 2024-11-25T16:39:39Z

aggregate15
Nov 25, 2024

So, I'm not very technial in these matters, but what is meant by: "Now you can even load clip-l for sd1.5 separately" ?, as mentioned by @lllyasviel in the original post ?

What would Clip-I do if I load it with sd1.5 ? I added it to the VAE/Text Encoder when using SD1.5 and it made no difference. In in simple words, what kind of prompt can I use with Clip-I and SD 1.5 that would make a difference ? Can someone give an example please ?

0 replies

yjaquinas · 2024-11-28T11:10:01Z

yjaquinas
Nov 28, 2024

I’m currently trying out the Flux model, specifically flux1-dev-Q4_0.gguf (tried other GGUF models, all with the same results). When I click the [Generate] button, I encounter the following error (running on Apple Silicon M2):

~$ ./webui.sh
.
.
.
################################################################
Launching launch.py...
################################################################
Python 3.10.14 (main, Apr 25 2024, 15:31:35) [Clang 15.0.0 (clang-1500.3.9.4)]
Version: f2.0.1v1.10.1-previous-626-ga332f7cc
Commit hash: a332f7cca35989412c7add36040d78694398b64b
Launching Web UI with arguments: --skip-torch-cuda-test --upcast-sampling --no-half-vae --use-cpu interrogate
Total VRAM 16384 MB, total RAM 16384 MB
pytorch version: 2.3.1
Set vram state to: SHARED
Device: mps
VAE dtype preferences: [torch.float32] -> torch.float32
CUDA Using Stream: False
Using sub quadratic optimization for cross attention
Using split attention for VAE
Warning: caught exception 'Torch not compiled with CUDA enabled', memory monitor disabled
.
.
.
.

  File "xxxx/stable-diffusion-webui-forge/packages_3rdparty/gguf/quick_4bits_ops.py", line 47, in quick_unpack_4bits
    y = torch.index_select(input=native_4bits_lookup_table, dim=0, index=x.to(dtype=torch.int32).flatten())
RuntimeError: Unsupported type byte size: UInt16
Unsupported type byte size: UInt16

Does anyone have hints or suggestions about where to look or how to debug this further? Thank you!

where I put the files:

config:

2 replies

srcrs Dec 7, 2024

I have similar problems。
RuntimeError: Unsupported type byte size: UInt16

yjaquinas Dec 11, 2024

It seems like that pytorch on MPS (Metal Performance Shaders) does not support the type UInt16 yet.
Not so sure. On forums people say using the nightly updated version of pytorch resolved the issue, but I still get the error :(.

Use the PyTorch installation selector on the installation page to choose Preview (Nightly) for MPS device acceleration. The MPS backend support is part of the PyTorch 1.12 official release. The Preview (Nightly) build of PyTorch will provide the latest mps support on your device.
https://developer.apple.com/metal/pytorch/

I will come back and update when I come to fix the issue.

krigeta · 2024-12-19T14:27:46Z

krigeta
Dec 19, 2024

I am trying to run a SDXL gguf model in hope to get faster speed as compare to the full fp16 model but I got this error Failed to recognize model type I have used a known script here by city96 user on githug, the models are working fine in ComfyUI but not working in forge but they should work. please check and yes I have separated the model's clip and vae so using them separately.

Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug  1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)]
Version: f2.0.1v1.10.1-previous-634-g37301b22
Commit hash: 37301b2219cd7c38bc466b1d1acd6725bc8c9947
Installing xformers
Installing requirements
Launching Web UI with arguments: --xformers --enable-insecure-extension-access --medvram-sdxl --theme=dark
Arg --medvram-sdxl is removed in Forge.
Now memory management is fully automatic and you do not need any command flags.
Please just remove this flag.
In extreme cases, if you want to force previous lowvram/medvram behaviors, please use --always-offload-from-vram
Total VRAM 8192 MB, total RAM 16310 MB
pytorch version: 2.3.1+cu121
xformers version: 0.0.27
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 2060 SUPER : native
Hint: your device supports --cuda-malloc for potential speed improvements.
VAE dtype preferences: [torch.float32] -> torch.float32
CUDA Using Stream: False
E:\webui_forge_cu121_torch231\system\python\lib\site-packages\transformers\utils\hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
Using xformers cross attention
Using xformers attention for VAE
ControlNet preprocessor location: E:\webui_forge_cu121_torch231\webui\models\ControlNetPreprocessor
Loading additional modules ... done.
2024-12-19 19:56:00,143 - ControlNet - INFO - ControlNet UI callback registered.
Model selected: {'checkpoint_info': {'filename': 'E:\\webui_forge_cu121_torch231\\webui\\models\\Stable-diffusion\\ComfyUI_00001_Q4_K_S.gguf', 'hash': 'eda635bf'}, 'additional_modules': ['E:\\webui_forge_cu121_torch231\\webui\\models\\VAE\\ComfyUI_vae_00001_.safetensors', 'E:\\webui_forge_cu121_torch231\\webui\\models\\text_encoder\\ComfyUI_clip_g_00001_.safetensors', 'E:\\webui_forge_cu121_torch231\\webui\\models\\text_encoder\\ComfyUI_clip_l_00001_.safetensors'], 'unet_storage_dtype': None}
Using online LoRAs in FP16: False
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 98.1s (prepare environment: 32.5s, launcher: 0.9s, import torch: 32.8s, initialize shared: 0.2s, other imports: 0.6s, setup gfpgan: 0.2s, list SD models: 0.2s, load scripts: 3.5s, initialize google blockly: 21.0s, create ui: 4.0s, gradio launch: 2.0s).
Environment vars changed: {'stream': False, 'inference_memory': 2727.0, 'pin_shared_memory': False}
[GPU Setting] You will use 66.71% GPU memory (5464.00 MB) to load weights, and use 33.29% GPU memory (2727.00 MB) to do matrix computation.
Environment vars changed: {'stream': False, 'inference_memory': 1024.0, 'pin_shared_memory': False}
[GPU Setting] You will use 87.50% GPU memory (7167.00 MB) to load weights, and use 12.50% GPU memory (1024.00 MB) to do matrix computation.
Environment vars changed: {'stream': False, 'inference_memory': 2727.0, 'pin_shared_memory': False}
[GPU Setting] You will use 66.71% GPU memory (5464.00 MB) to load weights, and use 33.29% GPU memory (2727.00 MB) to do matrix computation.
Loading Model: {'checkpoint_info': {'filename': 'E:\\webui_forge_cu121_torch231\\webui\\models\\Stable-diffusion\\ComfyUI_00001_Q4_K_S.gguf', 'hash': 'eda635bf'}, 'additional_modules': ['E:\\webui_forge_cu121_torch231\\webui\\models\\VAE\\ComfyUI_vae_00001_.safetensors', 'E:\\webui_forge_cu121_torch231\\webui\\models\\text_encoder\\ComfyUI_clip_g_00001_.safetensors', 'E:\\webui_forge_cu121_torch231\\webui\\models\\text_encoder\\ComfyUI_clip_l_00001_.safetensors'], 'unet_storage_dtype': None}
[Unload] Trying to free all memory for cuda:0 with 0 models keep loaded ... Done.
Traceback (most recent call last):
  File "E:\webui_forge_cu121_torch231\webui\backend\loader.py", line 274, in forge_loader
    state_dicts, estimated_config = split_state_dict(sd, additional_state_dicts=additional_state_dicts)
  File "E:\webui_forge_cu121_torch231\webui\backend\loader.py", line 240, in split_state_dict
    guess = huggingface_guess.guess(sd)
  File "E:\webui_forge_cu121_torch231\webui\repositories\huggingface_guess\huggingface_guess\__init__.py", line 7, in guess
    result.unet_key_prefix = [unet_key_prefix]
AttributeError: 'NoneType' object has no attribute 'unet_key_prefix'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "E:\webui_forge_cu121_torch231\webui\modules_forge\main_thread.py", line 30, in work
    self.result = self.func(*self.args, **self.kwargs)
  File "E:\webui_forge_cu121_torch231\webui\modules\txt2img.py", line 131, in txt2img_function
    processed = processing.process_images(p)
  File "E:\webui_forge_cu121_torch231\webui\modules\processing.py", line 836, in process_images
    manage_model_and_prompt_cache(p)
  File "E:\webui_forge_cu121_torch231\webui\modules\processing.py", line 804, in manage_model_and_prompt_cache
    p.sd_model, just_reloaded = forge_model_reload()
  File "E:\webui_forge_cu121_torch231\system\python\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "E:\webui_forge_cu121_torch231\webui\modules\sd_models.py", line 504, in forge_model_reload
    sd_model = forge_loader(state_dict, additional_state_dicts=additional_state_dicts)
  File "E:\webui_forge_cu121_torch231\system\python\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "E:\webui_forge_cu121_torch231\webui\backend\loader.py", line 276, in forge_loader
    raise ValueError('Failed to recognize model type!')
ValueError: Failed to recognize model type!
Failed to recognize model type!

0 replies

[GGUF and Flux full fp16 Model] loading T5, CLIP + new VAE UI #1050

lllyasviel Aug 13, 2024 Maintainer

New UI

Support All Flux Models for Ablative Experiments

Possible options

Fun fact

GGUF

Replies: 42 comments · 91 replies

lllyasviel Aug 23, 2024 Maintainer Author

lllyasviel Aug 24, 2024 Maintainer Author

lllyasviel
Aug 13, 2024
Maintainer

Replies: 42 comments 91 replies

lllyasviel Aug 23, 2024
Maintainer Author

lllyasviel Aug 24, 2024
Maintainer Author