[training] CogVideoX Lora #9302

a-r-r-o-w · 2024-08-28T13:45:21Z

What does this PR do?

Adds LoRA training and loading support for CogVideoX.

This is a rough draft and incomplete conversion from CogVideoX SAT.

#!/bin/bash

export TORCH_LOGS="+dynamo,recompiles,graph_breaks"
export TORCHDYNAMO_VERBOSE=1

GPU_IDS="3"

accelerate launch --gpu_ids $GPU_IDS examples/cogvideo/train_cogvideox_lora.py \
  --pretrained_model_name_or_path THUDM/CogVideoX-2b \
  --cache_dir <CACHE_DIR> \
  --instance_data_root <DATASET_ROOT_DIR> \
  --caption_column <CAPTION_COLUMN> \
  --video_column <VIDEO_COLUMN> \
  --id_token <ID_TOKEN> \
  --validation_prompt "<ID_TOKEN> A black and white animated scene unfolds, featuring a bulldog in overalls and a hat, standing on a ship's deck. The bulldog assumes various poses, then walks towards a dockside with two ducks and a cow. A wooden platform reads 'PODUNK LANDING,' while a building marked 'BOAT TICKETS' and scattered barrels hint at a destination. The bulldog and ducks move purposefully, possibly heading towards a food stand or boating services, amidst a monochromatic backdrop with no noticeable changes in environment or lighting:::A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance" \
  --validation_prompt_separator ::: \
  --num_validation_videos 1 \
  --validation_epochs 10 \
  --seed 42 \
  --rank 64 \
  --lora_alpha 64 \
  --mixed_precision fp16 \
  --output_dir /raid/aryan/cogvideox-lora \
  --height 480 --width 720 --fps 8 --max_num_frames 49 --skip_frames_start 0 --skip_frames_end 0 \
  --train_batch_size 1 \
  --num_train_epochs 40 \
  --checkpointing_steps 1000 \
  --gradient_accumulation_steps 1 \
  --learning_rate 1e-3 \
  --lr_scheduler cosine_with_restarts \
  --lr_warmup_steps 200 \
  --lr_num_cycles 1 \
  --enable_slicing \
  --enable_tiling \
  --optimizer Adam \
  --adam_beta1 0.9 \
  --adam_beta2 0.95 \
  --max_grad_norm 1.0 \
  --report_to wandb

The above is assuming a 50-video dataset (total of 2000 training steps)

TODO:

Implement tiled encoding (current OOMs for Cog-5B but works for Cog-2B)
Test with Prodigy optimizer
Determine best data preparation format and make the process more clean
Prepare dummy test data repository for others to test (Edit: Available internally on our org. No public release from diffusers team on this at the moment)
Remove unnecessary parameters
~~Verify outputs against SAT implementation~~ Don't match 1:1 possibly due to many reasons
Add lora tests
Docs

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sayakpaul @yiyixuxu @linoytsaban

cc @zRzRzRzRzRzRzR @bghira

HuggingFaceDocBuilderDev · 2024-08-28T13:51:23Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

G-U-N · 2024-09-02T16:52:36Z

Hi @a-r-r-o-w, did you achieve any satisfactory results in the current version? I tried the code on my machine but got broken generation results after just dozens of iterations.

G-U-N · 2024-09-02T17:33:23Z

The first issue I noticed is the re-parameterization was wrong.
After checking the official repo, I think it should be

                target = model_input
                
                (alpha_prod_t**0.5) * sample - (beta_prod_t**0.5)
                alphas_cumprod = scheduler.alphas_cumprod.to(model_pred.device, model_pred.dtype)
                alphas_cumprod_sqrt = alphas_cumprod[timesteps] ** 0.5
                c_skip = alphas_cumprod_sqrt
                c_out = -((1 - alphas_cumprod_sqrt**2) ** 0.5)
                while len(c_skip.shape) < len(model_pred.shape):
                    c_skip = c_skip.unsqueeze(-1) 
                while len(c_out.shape) < len(model_pred.shape):
                    c_out = c_out.unsqueeze(-1)             
                weights = 1 / ( 1-alphas_cumprod_sqrt**2)
                while len(weights.shape) < len(model_pred.shape):
                    weights = weights.unsqueeze(-1)        
            
                
                model_pred = c_out * model_pred + c_skip * model_input

But after fixing it, I still got broken results after around 200 iterations. Any advice would be appreciated.

yiyixuxu · 2024-09-02T17:58:16Z

cc @bghira here in case you have interest and time to help a little bit with CogVideoX lora! (no worries if not!)

bghira · 2024-09-02T19:22:00Z

can you plot some of the values during inference that work and then compare them to training?

G-U-N · 2024-09-03T01:13:15Z

Great @yiyixuxu @bghira, I am willing to assist if there is any need.

a-r-r-o-w · 2024-09-03T02:08:48Z

The first issue I noticed is the re-parameterization was wrong.

Hey, thanks a lot for noticing this! I haven't been able to generate any good results too. I actually have the following locally:

model_output = transformer(
    hidden_states=noisy_model_input,
    encoder_hidden_states=prompt_embeds,
    timestep=timesteps,
    image_rotary_emb=image_rotary_emb,
    return_dict=False,
)[0]
alphas_cumprod = scheduler.alphas_cumprod[timesteps]
alphas_cumprod_sqrt = alphas_cumprod ** 0.5
one_minus_alphas_cumprod_sqrt = (1 - alphas_cumprod) ** 0.5
model_pred = noisy_model_input * alphas_cumprod_sqrt - model_output * one_minus_alphas_cumprod_sqrt

Should it be model_input * alphas_cumprod_sqrt - ... here instead of noisy_model_input? From the original codebase, I think noisy_model_input is correct here but doesn't work yet possibly due to a different bug.

G-U-N · 2024-09-03T02:27:30Z

Ops, sorry. I made a typo. It should be noisy_model input. @a-r-r-o-w
I am going to test it on my code and report my training to you.

sayakpaul

Left some comments. Let's maybe also add a test case first to quickly identify potential suspects?

src/diffusers/models/transformers/cogvideox_transformer_3d.py

src/diffusers/pipelines/cogvideo/pipeline_cogvideox.py

src/diffusers/loaders/lora_pipeline.py

sayakpaul

Left some questions here and there (some of them are clarification questions so bear with me).

examples/cogvideo/train_cogvideox_lora.py

G-U-N · 2024-09-03T12:40:00Z

Here's a quick test. I apply a single frame video (video_length = 1) for tuning with batch size 1. The learning rate is set to 1e-3. I trained for 500 iterations and It can reproduce the trained frame. Yet I can observe the generation results broken in the middle iterations. I also trained the lora on the same but longer video and observed similar results. I tried to load the trained lora and generate new videos. It can still follow prompt to generate other videos but suffer from quality degradation.

Validation outputs

0_validation_video_0_The_video_features_a_man_.mp4

40_validation_video_0_The_video_features_a_man_.mp4

80_validation_video_0_The_video_features_a_man_.mp4

100_validation_video_0_The_video_features_a_man_.mp4

120_validation_video_0_The_video_features_a_man_.mp4

160_validation_video_0_The_video_features_a_man_.mp4

200_validation_video_0_The_video_features_a_man_.mp4

240_validation_video_0_The_video_features_a_man_.mp4

280_validation_video_0_The_video_features_a_man_.mp4

320_validation_video_0_The_video_features_a_man_.mp4

360_validation_video_0_The_video_features_a_man_.mp4

400_validation_video_0_The_video_features_a_man_.mp4

440_validation_video_0_The_video_features_a_man_.mp4

a-r-r-o-w · 2024-09-03T12:50:07Z

I am facing similar issues too after overfitting on single example even after 1000 steps. Will try to take another deep look some time soon but AFAICT, there don't seem to be any more differences.

I've left a comment asking some questions. From the different discussions, I gather that ~100 videos and 4000+ steps seem to be ideal for finetuning. This seems very different from normal Dreambooth-like finetuning tbh where just a few examples would be okay to teach new concepts.

Maybe @zRzRzRzRzRzRzR @tengjiayan20 can hopefully take a look and help here.

G-U-N · 2024-09-03T12:53:16Z

I am facing similar issues too after overfitting on single example even after 1000 steps. Will try to take another deep look some time soon but AFAICT, there don't seem to be any more differences.

I've left a comment asking some questions. From the different discussions, I gather that ~100 videos and 4000+ steps seem to be ideal for finetuning. This seems very different from normal Dreambooth-like finetuning tbh where just a few examples would be okay to teach new concepts.

Maybe @zRzRzRzRzRzRzR @tengjiayan20 can hopefully take a look and help here.

Very insightful comment @a-r-r-o-w . Thanks for the reply.

FDInSky · 2024-09-04T08:34:14Z

The first issue I noticed is the re-parameterization was wrong.

Hey, thanks a lot for noticing this! I haven't been able to generate any good results too. I actually have the following locally:
model_output = transformer(
    hidden_states=noisy_model_input,
    encoder_hidden_states=prompt_embeds,
    timestep=timesteps,
    image_rotary_emb=image_rotary_emb,
    return_dict=False,
)[0]
alphas_cumprod = scheduler.alphas_cumprod[timesteps]
alphas_cumprod_sqrt = alphas_cumprod ** 0.5
one_minus_alphas_cumprod_sqrt = (1 - alphas_cumprod) ** 0.5
model_pred = noisy_model_input * alphas_cumprod_sqrt - model_output * one_minus_alphas_cumprod_sqrt
Should it be model_input * alphas_cumprod_sqrt - ... here instead of noisy_model_input? From the original codebase, I think noisy_model_input is correct here but doesn't work yet possibly due to a different bug.

I occur an error here, now is there a solution about it ? Thanks
RuntimeError: The size of tensor a (90) must match the size of tensor b (2) at non-singleton dimension 4

a-r-r-o-w · 2024-09-04T08:38:30Z

I occur an error here, now is there a solution about it ? Thanks
RuntimeError: The size of tensor a (90) must match the size of tensor b (2) at non-singleton dimension 4

I don't seem to get this error nor G-U-N. Could you provide more context? What are your flags when launching this script? What specific line does it fail on? Have you modified the script in any way?

FDInSky · 2024-09-04T08:44:27Z

I occur an error here, now is there a solution about it ? Thanks
RuntimeError: The size of tensor a (90) must match the size of tensor b (2) at non-singleton dimension 4

I don't seem to get this error nor G-U-N. Could you provide more context? What are your flags when launching this script? What specific line does it fail on? Have you modified the script in any way?

what is the shape of tensor alphas_cumprod ? Thanks

a-r-r-o-w · 2024-09-04T08:48:10Z

what is the shape of tensor alphas_cumprod ? Thanks

scheduler.alphas_cumprod has shape (1000,). alphas_cumprod in the training script is indexed using timesteps, which has the shape of (train_batch_size,) so that should be the shape. I've only experimented with train_batch_size=1. Are you using higher by any chance?

FDInSky · 2024-09-04T08:51:26Z

what is the shape of tensor alphas_cumprod ? Thanks

scheduler.alphas_cumprod has shape (1000,). alphas_cumprod in the training script is indexed using timesteps, which has the shape of (train_batch_size,) so that should be the shape. I've only experimented with train_batch_size=1. Are you using higher by any chance?

I use batch_size = 2, The problem may be it. Thanks

Co-Authored-By: Fu-Yun Wang <[email protected]>

Co-Authored-By: bghira <[email protected]>

a-r-r-o-w · 2024-09-17T23:45:04Z

examples/cogvideo/train_cogvideox_lora.py

+        help="Number of frames to skip from the end of each input video. Useful if training data contains outro sequences.",
+    )
+    parser.add_argument(
+        "--random_flip",


I think not used yet. TODO to support in follow-up PR

a-r-r-o-w · 2024-09-17T23:46:52Z

examples/cogvideo/train_cogvideox_lora.py

+
+        # Downloading and loading a dataset from the hub. See more about loading custom images at
+        # https://huggingface.co/docs/datasets/v2.0.0/en/dataset_script
+        dataset = load_dataset(


The load_dataset method is not too good here due to lack of support for video data. Similar to how I did in the lora testing script, I think supporting snapshot_download from the hub would be nice to have and easier.

a-r-r-o-w · 2024-09-17T23:47:47Z

examples/cogvideo/train_cogvideox_lora.py

+    return parser.parse_args()
+
+
+class VideoDataset(Dataset):


TODOs:

Support loading latents directly instead of videos

Create a prepare_dataset.py for preprocessing data, and possibly having captioning utilities

a-r-r-o-w · 2024-09-17T23:48:23Z

examples/cogvideo/train_cogvideox_lora.py

+
+    def _preprocess_data(self):
+        try:
+            import decord


TODO: Maybe better to add as a backend for load_video in the future.

examples/cogvideo/train_cogvideox_lora.py

a-r-r-o-w · 2024-09-17T23:53:30Z

src/diffusers/loaders/lora_pipeline.py

+    [`CogVideoX`].
+    """
+
+    _lora_loadable_modules = ["transformer", "text_encoder"]


Suggested change

_lora_loadable_modules = ["transformer", "text_encoder"]

_lora_loadable_modules = ["transformer"]

For now, since we removed text encoder related training, need to remove everything related to this in the lora loader

ok we should remove text_encoder from lora in this PR?

So, I tried removing it but this causes test failures in ~10 different places, so I chose not to do it for now since it would require significant modification of many tests. I think it's okay to leave it here in case someone manages to fine tune the text encoder and wants to use it (eventually we can add support too).

don't think we should add this code unless it is needed though
I just went through all the loraloadermixin here, I think we currently do not support t5 at all - cc @sayakpaul here to confirm if it is the case?

Yes I came to the same conclusion. I'm actually working on removing thet text encoder parts at the moment so will update in a bit.

Need to fight a few more tests 👊

Yeah community doesn't really do T5 at the moment. So, we don't support it. No LoRA that is extremely popular has T5 (at least that is what @apolinario and myself have known). But supporting it is no big deal really.

src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py

This reverts commit f8a8444.

tests/lora/utils.py

yiyixuxu · 2024-09-18T06:31:03Z

tests/lora/utils.py

@@ -690,14 +708,21 @@ def test_simple_inference_with_text_denoiser_lora_and_scale(self):
        scheduler_classes = (
            [FlowMatchEulerDiscreteScheduler] if self.uses_flow_matching else [DDIMScheduler, LCMScheduler]
        )
+        call_signature_keys = inspect.signature(self.pipeline_class.__call__).parameters.keys()


this can be a property (child class can overwrite it too) - ok to keep it as it is the test here;

tests/lora/utils.py

yiyixuxu

thanks! and congrats on winning the fights against the lora test!

src/diffusers/models/transformers/cogvideox_transformer_3d.py

tests/lora/utils.py

Co-Authored-By: YiYi Xu <[email protected]>

963658029 · 2024-09-24T09:08:08Z

Why didn't the code run the following two lines of code after calculating the loss?
avg_loss = accelerator.gather(loss.repeat(args.train_batch_size)).mean()
train_loss += avg_loss.item() / args.gradient_accumulation_steps

* cogvideox lora training draft * update * update * update * update * update * make fix-copies * update * update * apply suggestions from review * apply suggestions from reveiw * fix typo * Update examples/cogvideo/train_cogvideox_lora.py Co-authored-by: YiYi Xu <[email protected]> * fix lora alpha * use correct lora scaling for final test pipeline * Update examples/cogvideo/train_cogvideox_lora.py Co-authored-by: YiYi Xu <[email protected]> * apply suggestions from review; prodigy optimizer YiYi Xu <[email protected]> * add tests * make style * add README * update * update * make style * fix * update * add test skeleton * revert lora utils changes * add cleaner modifications to lora testing utils * update lora tests * deepspeed stuff * add requirements.txt * deepspeed refactor * add lora stuff to img2vid pipeline to fix tests * fight tests * add co-authors Co-Authored-By: Fu-Yun Wang <[email protected]> Co-Authored-By: zR <[email protected]> * fight lora runner tests * import Dummy optim and scheduler only wheh required * update docs * add coauthors Co-Authored-By: Fu-Yun Wang <[email protected]> * remove option to train text encoder Co-Authored-By: bghira <[email protected]> * update tests * fight more tests * update * fix vid2vid * fix typo * remove lora tests; todo in follow-up PR * undo img2vid changes * remove text encoder related changes in lora loader mixin * Revert "remove text encoder related changes in lora loader mixin" This reverts commit f8a8444. * update * round 1 of fighting tests * round 2 of fighting tests * fix copied from comment * fix typo in lora test * update styling Co-Authored-By: YiYi Xu <[email protected]> --------- Co-authored-by: YiYi Xu <[email protected]> Co-authored-by: zR <[email protected]> Co-authored-by: Fu-Yun Wang <[email protected]> Co-authored-by: bghira <[email protected]>

cogvideox lora training draft

dc08234

a-r-r-o-w added 2 commits August 28, 2024 15:51

update

f12e669

update

24c362c

zRzRzRzRzRzRzR mentioned this pull request Aug 30, 2024

3D VAE finetune THUDM/CogVideo#111

Closed

a-r-r-o-w added 2 commits August 31, 2024 15:41

update

588c6ee

update

74e6f90

a-r-r-o-w mentioned this pull request Sep 2, 2024

[core] CogVideoX memory optimizations in VAE encode #9340

Merged

a-r-r-o-w added 3 commits September 3, 2024 05:22

Merge branch 'main' into cogvideox-lora-and-training

63e80b7

update

9a95d8d

make fix-copies

efa9b0a

sayakpaul reviewed Sep 3, 2024

View reviewed changes

src/diffusers/models/transformers/cogvideox_transformer_3d.py Show resolved Hide resolved

src/diffusers/pipelines/cogvideo/pipeline_cogvideox.py Outdated Show resolved Hide resolved

src/diffusers/loaders/lora_pipeline.py Show resolved Hide resolved

sayakpaul reviewed Sep 3, 2024

View reviewed changes

examples/cogvideo/train_cogvideox_lora.py Outdated Show resolved Hide resolved

update

4c56287

a-r-r-o-w and others added 3 commits September 17, 2024 03:50

import Dummy optim and scheduler only wheh required

f07755f

update docs

57d7ca6

add coauthors

f8fd727

Co-Authored-By: Fu-Yun Wang <[email protected]>

zRzRzRzRzRzRzR mentioned this pull request Sep 17, 2024

Work plan and enhancement / 工作计划和用户诉求 THUDM/CogVideo#194

Open

a-r-r-o-w and others added 8 commits September 17, 2024 21:51

remove option to train text encoder

0c8ec36

Co-Authored-By: bghira <[email protected]>

update tests

0e1c569

fight more tests

7c84394

update

5893fdc

fix vid2vid

60ea9ae

fix typo

14d2191

remove lora tests; todo in follow-up PR

f9f47ea

Merge branch 'main' into cogvideox-lora-and-training

6ab5047

a-r-r-o-w commented Sep 17, 2024

View reviewed changes

a-r-r-o-w added 4 commits September 18, 2024 04:31

undo img2vid changes

a3f3fa1

remove text encoder related changes in lora loader mixin

f8a8444

Revert "remove text encoder related changes in lora loader mixin"

4c92f62

This reverts commit f8a8444.

update

f138eab

yiyixuxu reviewed Sep 18, 2024

View reviewed changes

a-r-r-o-w added 5 commits September 18, 2024 21:42

Merge branch 'main' into cogvideox-lora-and-training

2e57269

round 1 of fighting tests

47937cd

round 2 of fighting tests

528bd73

fix copied from comment

fda6604

fix typo in lora test

6b586ea

yiyixuxu approved these changes Sep 18, 2024

View reviewed changes

src/diffusers/models/transformers/cogvideox_transformer_3d.py Outdated Show resolved Hide resolved

tests/lora/utils.py Outdated Show resolved Hide resolved

a-r-r-o-w and others added 2 commits September 19, 2024 08:44

update styling

ac68ee2

Co-Authored-By: YiYi Xu <[email protected]>

Merge branch 'main' into cogvideox-lora-and-training

a2e850b

a-r-r-o-w merged commit 2b443a5 into main Sep 19, 2024
18 checks passed

a-r-r-o-w deleted the cogvideox-lora-and-training branch September 19, 2024 09:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[training] CogVideoX Lora #9302

[training] CogVideoX Lora #9302

a-r-r-o-w commented Aug 28, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 28, 2024

G-U-N commented Sep 2, 2024

G-U-N commented Sep 2, 2024

yiyixuxu commented Sep 2, 2024

bghira commented Sep 2, 2024

G-U-N commented Sep 3, 2024

a-r-r-o-w commented Sep 3, 2024

G-U-N commented Sep 3, 2024 •

edited

Loading

sayakpaul left a comment

sayakpaul left a comment

G-U-N commented Sep 3, 2024 •

edited by a-r-r-o-w

Loading

a-r-r-o-w commented Sep 3, 2024

G-U-N commented Sep 3, 2024

FDInSky commented Sep 4, 2024

a-r-r-o-w commented Sep 4, 2024

FDInSky commented Sep 4, 2024

a-r-r-o-w commented Sep 4, 2024

FDInSky commented Sep 4, 2024

a-r-r-o-w Sep 17, 2024

a-r-r-o-w Sep 17, 2024

a-r-r-o-w Sep 17, 2024

a-r-r-o-w Sep 17, 2024

a-r-r-o-w Sep 17, 2024

yiyixuxu Sep 18, 2024

a-r-r-o-w Sep 18, 2024

yiyixuxu Sep 18, 2024

a-r-r-o-w Sep 18, 2024

sayakpaul Sep 19, 2024

yiyixuxu Sep 18, 2024

yiyixuxu left a comment

963658029 commented Sep 24, 2024

	_lora_loadable_modules = ["transformer", "text_encoder"]
	_lora_loadable_modules = ["transformer"]

[training] CogVideoX Lora #9302

[training] CogVideoX Lora #9302

Conversation

a-r-r-o-w commented Aug 28, 2024 • edited Loading

What does this PR do?

Who can review?

HuggingFaceDocBuilderDev commented Aug 28, 2024

G-U-N commented Sep 2, 2024

G-U-N commented Sep 2, 2024

yiyixuxu commented Sep 2, 2024

bghira commented Sep 2, 2024

G-U-N commented Sep 3, 2024

a-r-r-o-w commented Sep 3, 2024

G-U-N commented Sep 3, 2024 • edited Loading

sayakpaul left a comment

Choose a reason for hiding this comment

sayakpaul left a comment

Choose a reason for hiding this comment

G-U-N commented Sep 3, 2024 • edited by a-r-r-o-w Loading

a-r-r-o-w commented Sep 3, 2024

G-U-N commented Sep 3, 2024

FDInSky commented Sep 4, 2024

a-r-r-o-w commented Sep 4, 2024

FDInSky commented Sep 4, 2024

a-r-r-o-w commented Sep 4, 2024

FDInSky commented Sep 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yiyixuxu left a comment

Choose a reason for hiding this comment

963658029 commented Sep 24, 2024

a-r-r-o-w commented Aug 28, 2024 •

edited

Loading

G-U-N commented Sep 3, 2024 •

edited

Loading

G-U-N commented Sep 3, 2024 •

edited by a-r-r-o-w

Loading