Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat]: Long Clip support #624

Open
miasik opened this issue Jan 2, 2025 · 8 comments
Open

[Feat]: Long Clip support #624

miasik opened this issue Jan 2, 2025 · 8 comments
Labels
enhancement New feature or request

Comments

@miasik
Copy link

miasik commented Jan 2, 2025

Describe your use-case.

I'm asking for support training models with integrated Long Clip_L(246 effective tokens vs 75):
https://arxiv.org/abs/2403.15378
I've asked and got the answer that integrating Long Clip_L is possible:
https://huggingface.co/zer0int/LongCLIP-GmP-ViT-L-14/discussions/6

What would you like to see as a solution?

As I see things, It can be a checkbox in "Text Encoder 1" with a description like "It's Long Clip_L". When it's checked OT cuts captions after 246 tokens instead of 75

Have you considered alternatives? List them here.

No response

@miasik miasik added the enhancement New feature or request label Jan 2, 2025
@Heasterian
Copy link
Contributor

As I have it working locally, but in not upstreamable way I'll write down what I figured out along the way.

Files of LongClip from this repo by default comes as whole ClipModel, OneTrainer by default use ClipTextModel (equivalent to ClipModel.text_model).

If my commit pass https://huggingface.co/zer0int/LongCLIP-GmP-ViT-L-14/commit/59dd3e4d98acf93ef5093091981fe447e947ae1c it will be easier to differentiate between Clip and LongClip just from config.json and set proper max_length in modules/model for models using Clip-L. For now pipeline won't run without changing config or setting somewhere

text_encoder.max_position_embeddings = 248

or

text_encoder.text_config.max_position_embeddings = 248

depending on implementation.

I have no idea how to differentiate between LongClip and Clip when using single file safetensor instead of diffusers format.

@Heasterian
Copy link
Contributor

You can download longClip in form that should work out of the box using this python code:

from transformers import CLIPTextModel, CLIPTokenizer

tokenizer = CLIPTokenizer.from_pretrained("zer0int/LongCLIP-GmP-ViT-L-14")
model = CLIPTextModel.from_pretrained("zer0int/LongCLIP-GmP-ViT-L-14")
tokenizer.save_pretrained("./tokenizer")
model.save_pretrained("./text_encoder")

Just download SD 1.5 or Flux model in diffusers format and overwrite models text_encoder and tokenizer directories with ones saved by script.

Than use branch from link below to have whole 248 token limit support.

https://github.com/Heasterian/OneTrainer/blob/LongClip/

I do not have Flux downloaded and tested, so let me know if it works as it should as implementation is a little bit different due to two different types of encoders.

@miasik
Copy link
Author

miasik commented Jan 4, 2025

I used Comfy to replace Clip to LongClip for one of my models. The combined checkpoint was successfully saved and then loaded, but I got an error trying to render images with it.

[2025-01-04 00:11:27.017] CLIP model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
[2025-01-04 00:11:27.122] !!! Exception during processing !!! Error(s) in loading state_dict for SD1ClipModel:
	size mismatch for clip_l.transformer.text_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([248, 768]) from checkpoint, the shape in current model is torch.Size([77, 768]).
[2025-01-04 00:11:27.125] Traceback (most recent call last):
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\execution.py", line 327, in execute
    output_data, output_ui, has_subgraph = get_output_data(obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\execution.py", line 202, in get_output_data
    return_values = _map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\execution.py", line 174, in _map_node_over_list
    process_inputs(input_dict, i)
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\execution.py", line 163, in process_inputs
    results.append(getattr(obj, func)(**inputs))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\nodes.py", line 568, in load_checkpoint
    out = comfy.sd.load_checkpoint_guess_config(ckpt_path, output_vae=True, output_clip=True, embedding_directory=folder_paths.get_folder_paths("embeddings"))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\comfy\sd.py", line 826, in load_checkpoint_guess_config
    out = load_state_dict_guess_config(sd, output_vae, output_clip, output_clipvision, embedding_directory, output_model, model_options, te_model_options=te_model_options)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\comfy\sd.py", line 881, in load_state_dict_guess_config
    m, u = clip.load_sd(clip_sd, full_model=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\comfy\sd.py", line 228, in load_sd
    return self.cond_stage_model.load_state_dict(sd, strict=False)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 2584, in load_state_dict
    raise RuntimeError(
RuntimeError: Error(s) in loading state_dict for SD1ClipModel:
	size mismatch for clip_l.transformer.text_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([248, 768]) from checkpoint, the shape in current model is torch.Size([77, 768]).

@Heasterian
Copy link
Contributor

Heasterian commented Jan 4, 2025

Well, you are loading model from single file, not diffusers format I mentioned. With safetensors code is falling back to 77 tokens as config does not include info about max position embeddings.

Does comfy save .yaml file along safetensors? If yes, send it here.

@miasik
Copy link
Author

miasik commented Jan 4, 2025

Yeah, I know the word "diffusers" but I'm not sure I'm able to work with it
No, Comfy doesn't save yaml
I've just found that Comfy has an extension to work with diffusers but I can't try it right now
https://github.com/Limitex/ComfyUI-Diffusers?tab=readme-ov-file
:-(

@Heasterian
Copy link
Contributor

You can convert model to diffusers format using tool from tools tab in Onetrainer.

@Heasterian
Copy link
Contributor

Just overwrite text_encoder and tokenizer in resulting directory as I said here: #624 (comment)

@Heasterian
Copy link
Contributor

I used Comfy to replace Clip to LongClip for one of my models. The combined checkpoint was successfully saved and then loaded, but I got an error trying to render images with it.

[2025-01-04 00:11:27.017] CLIP model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
[2025-01-04 00:11:27.122] !!! Exception during processing !!! Error(s) in loading state_dict for SD1ClipModel:
	size mismatch for clip_l.transformer.text_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([248, 768]) from checkpoint, the shape in current model is torch.Size([77, 768]).
[2025-01-04 00:11:27.125] Traceback (most recent call last):
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\execution.py", line 327, in execute
    output_data, output_ui, has_subgraph = get_output_data(obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\execution.py", line 202, in get_output_data
    return_values = _map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\execution.py", line 174, in _map_node_over_list
    process_inputs(input_dict, i)
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\execution.py", line 163, in process_inputs
    results.append(getattr(obj, func)(**inputs))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\nodes.py", line 568, in load_checkpoint
    out = comfy.sd.load_checkpoint_guess_config(ckpt_path, output_vae=True, output_clip=True, embedding_directory=folder_paths.get_folder_paths("embeddings"))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\comfy\sd.py", line 826, in load_checkpoint_guess_config
    out = load_state_dict_guess_config(sd, output_vae, output_clip, output_clipvision, embedding_directory, output_model, model_options, te_model_options=te_model_options)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\comfy\sd.py", line 881, in load_state_dict_guess_config
    m, u = clip.load_sd(clip_sd, full_model=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\comfy\sd.py", line 228, in load_sd
    return self.cond_stage_model.load_state_dict(sd, strict=False)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 2584, in load_state_dict
    raise RuntimeError(
RuntimeError: Error(s) in loading state_dict for SD1ClipModel:
	size mismatch for clip_l.transformer.text_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([248, 768]) from checkpoint, the shape in current model is torch.Size([77, 768]).

I just saw that this is about Comfy not loading this model, not Onetrainer. You should open issue on Comfy repo about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants