[Feat]: Long Clip support #624

miasik · 2025-01-02T12:28:52Z

Describe your use-case.

I'm asking for support training models with integrated Long Clip_L(246 effective tokens vs 75):
https://arxiv.org/abs/2403.15378
I've asked and got the answer that integrating Long Clip_L is possible:
https://huggingface.co/zer0int/LongCLIP-GmP-ViT-L-14/discussions/6

What would you like to see as a solution?

As I see things, It can be a checkbox in "Text Encoder 1" with a description like "It's Long Clip_L". When it's checked OT cuts captions after 246 tokens instead of 75

Have you considered alternatives? List them here.

No response

Heasterian · 2025-01-02T16:53:47Z

As I have it working locally, but in not upstreamable way I'll write down what I figured out along the way.

Files of LongClip from this repo by default comes as whole ClipModel, OneTrainer by default use ClipTextModel (equivalent to ClipModel.text_model).

If my commit pass https://huggingface.co/zer0int/LongCLIP-GmP-ViT-L-14/commit/59dd3e4d98acf93ef5093091981fe447e947ae1c it will be easier to differentiate between Clip and LongClip just from config.json and set proper max_length in modules/model for models using Clip-L. For now pipeline won't run without changing config or setting somewhere

text_encoder.max_position_embeddings = 248

or

text_encoder.text_config.max_position_embeddings = 248

depending on implementation.

I have no idea how to differentiate between LongClip and Clip when using single file safetensor instead of diffusers format.

Heasterian · 2025-01-04T15:16:03Z

You can download longClip in form that should work out of the box using this python code:

from transformers import CLIPTextModel, CLIPTokenizer

tokenizer = CLIPTokenizer.from_pretrained("zer0int/LongCLIP-GmP-ViT-L-14")
model = CLIPTextModel.from_pretrained("zer0int/LongCLIP-GmP-ViT-L-14")
tokenizer.save_pretrained("./tokenizer")
model.save_pretrained("./text_encoder")

Just download SD 1.5 or Flux model in diffusers format and overwrite models text_encoder and tokenizer directories with ones saved by script.

Than use branch from link below to have whole 248 token limit support.

https://github.com/Heasterian/OneTrainer/blob/LongClip/

I do not have Flux downloaded and tested, so let me know if it works as it should as implementation is a little bit different due to two different types of encoders.

miasik · 2025-01-04T15:43:10Z

I used Comfy to replace Clip to LongClip for one of my models. The combined checkpoint was successfully saved and then loaded, but I got an error trying to render images with it.

[2025-01-04 00:11:27.017] CLIP model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
[2025-01-04 00:11:27.122] !!! Exception during processing !!! Error(s) in loading state_dict for SD1ClipModel:
	size mismatch for clip_l.transformer.text_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([248, 768]) from checkpoint, the shape in current model is torch.Size([77, 768]).
[2025-01-04 00:11:27.125] Traceback (most recent call last):
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\execution.py", line 327, in execute
    output_data, output_ui, has_subgraph = get_output_data(obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\execution.py", line 202, in get_output_data
    return_values = _map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\execution.py", line 174, in _map_node_over_list
    process_inputs(input_dict, i)
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\execution.py", line 163, in process_inputs
    results.append(getattr(obj, func)(**inputs))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\nodes.py", line 568, in load_checkpoint
    out = comfy.sd.load_checkpoint_guess_config(ckpt_path, output_vae=True, output_clip=True, embedding_directory=folder_paths.get_folder_paths("embeddings"))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\comfy\sd.py", line 826, in load_checkpoint_guess_config
    out = load_state_dict_guess_config(sd, output_vae, output_clip, output_clipvision, embedding_directory, output_model, model_options, te_model_options=te_model_options)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\comfy\sd.py", line 881, in load_state_dict_guess_config
    m, u = clip.load_sd(clip_sd, full_model=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\comfy\sd.py", line 228, in load_sd
    return self.cond_stage_model.load_state_dict(sd, strict=False)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 2584, in load_state_dict
    raise RuntimeError(
RuntimeError: Error(s) in loading state_dict for SD1ClipModel:
	size mismatch for clip_l.transformer.text_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([248, 768]) from checkpoint, the shape in current model is torch.Size([77, 768]).

Heasterian · 2025-01-04T15:51:28Z

Well, you are loading model from single file, not diffusers format I mentioned. With safetensors code is falling back to 77 tokens as config does not include info about max position embeddings.

Does comfy save .yaml file along safetensors? If yes, send it here.

miasik · 2025-01-04T16:19:35Z

Yeah, I know the word "diffusers" but I'm not sure I'm able to work with it
No, Comfy doesn't save yaml
I've just found that Comfy has an extension to work with diffusers but I can't try it right now
https://github.com/Limitex/ComfyUI-Diffusers?tab=readme-ov-file
:-(

Heasterian · 2025-01-04T16:24:08Z

You can convert model to diffusers format using tool from tools tab in Onetrainer.

Heasterian · 2025-01-04T16:26:25Z

Just overwrite text_encoder and tokenizer in resulting directory as I said here: #624 (comment)

Heasterian · 2025-01-05T08:04:56Z

I used Comfy to replace Clip to LongClip for one of my models. The combined checkpoint was successfully saved and then loaded, but I got an error trying to render images with it.

[2025-01-04 00:11:27.017] CLIP model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
[2025-01-04 00:11:27.122] !!! Exception during processing !!! Error(s) in loading state_dict for SD1ClipModel:
	size mismatch for clip_l.transformer.text_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([248, 768]) from checkpoint, the shape in current model is torch.Size([77, 768]).
[2025-01-04 00:11:27.125] Traceback (most recent call last):
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\execution.py", line 327, in execute
    output_data, output_ui, has_subgraph = get_output_data(obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\execution.py", line 202, in get_output_data
    return_values = _map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\execution.py", line 174, in _map_node_over_list
    process_inputs(input_dict, i)
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\execution.py", line 163, in process_inputs
    results.append(getattr(obj, func)(**inputs))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\nodes.py", line 568, in load_checkpoint
    out = comfy.sd.load_checkpoint_guess_config(ckpt_path, output_vae=True, output_clip=True, embedding_directory=folder_paths.get_folder_paths("embeddings"))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\comfy\sd.py", line 826, in load_checkpoint_guess_config
    out = load_state_dict_guess_config(sd, output_vae, output_clip, output_clipvision, embedding_directory, output_model, model_options, te_model_options=te_model_options)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\comfy\sd.py", line 881, in load_state_dict_guess_config
    m, u = clip.load_sd(clip_sd, full_model=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\ComfyUI\comfy\sd.py", line 228, in load_sd
    return self.cond_stage_model.load_state_dict(sd, strict=False)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "g:\StableDiffusion\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 2584, in load_state_dict
    raise RuntimeError(
RuntimeError: Error(s) in loading state_dict for SD1ClipModel:
	size mismatch for clip_l.transformer.text_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([248, 768]) from checkpoint, the shape in current model is torch.Size([77, 768]).

I just saw that this is about Comfy not loading this model, not Onetrainer. You should open issue on Comfy repo about this.

miasik added the enhancement New feature or request label Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat]: Long Clip support #624

[Feat]: Long Clip support #624

miasik commented Jan 2, 2025

Heasterian commented Jan 2, 2025

Heasterian commented Jan 4, 2025

miasik commented Jan 4, 2025

Heasterian commented Jan 4, 2025 •

edited

Loading

miasik commented Jan 4, 2025

Heasterian commented Jan 4, 2025

Heasterian commented Jan 4, 2025

Heasterian commented Jan 5, 2025

[Feat]: Long Clip support #624

[Feat]: Long Clip support #624

Comments

miasik commented Jan 2, 2025

Describe your use-case.

What would you like to see as a solution?

Have you considered alternatives? List them here.

Heasterian commented Jan 2, 2025

Heasterian commented Jan 4, 2025

miasik commented Jan 4, 2025

Heasterian commented Jan 4, 2025 • edited Loading

miasik commented Jan 4, 2025

Heasterian commented Jan 4, 2025

Heasterian commented Jan 4, 2025

Heasterian commented Jan 5, 2025

Heasterian commented Jan 4, 2025 •

edited

Loading