Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Free guidance: IndexError: The shape of the mask [0] at index 0 does not match the shape of the indexed tensor [1, 9216, 320] at index 0 #117

Open
pearbender opened this issue May 13, 2024 · 3 comments
Assignees

Comments

@pearbender
Copy link
Contributor

Line 180 here fails in stage 1 training.

if do_classifier_free_guidance:
hidden_states_c = hidden_states_uc.clone()
_uc_mask = uc_mask.clone()
if hidden_states.shape[0] != _uc_mask.shape[0]:
_uc_mask = (
torch.Tensor(
[1] * (hidden_states.shape[0] // 2)
+ [0] * (hidden_states.shape[0] // 2)
)
.to(device)
.bool()
)
hidden_states_c[_uc_mask] = (
self.attn1(
norm_hidden_states[_uc_mask],
encoder_hidden_states=norm_hidden_states[_uc_mask],
attention_mask=attention_mask,
)
+ hidden_states[_uc_mask]
)
hidden_states = hidden_states_c.clone()

I had to do

do_classifier_free_guidance = False

to prevent this error, however I do not know how this will impact the result.

Here is my terminal log.

(env) C:\Users\user\code\champ>accelerate launch train_s1.py --config configs/train/stage1.yaml
The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_processes` was set to a value of `1`
        `--num_machines` was set to a value of `1`
        `--mixed_precision` was set to a value of `'no'`
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
05/13/2024 12:50:07 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

{'force_upcast', 'scaling_factor'} was not found in config. Values will be initialized to default values.
{'mid_block_only_cross_attention', 'addition_time_embed_dim', 'cross_attention_norm', 'class_embeddings_concat', 'reverse_transformer_layers_per_block', 'encoder_hid_dim', 'class_embed_type', 'num_attention_heads', 'encoder_hid_dim_type', 
'projection_class_embeddings_input_dim', 'addition_embed_type_num_heads', 'addition_embed_type', 'dropout', 'resnet_time_scale_shift', 'time_cond_proj_dim', 'time_embedding_act_fn', 'resnet_out_scale_factor', 'dual_cross_attention', 'only_cross_attention', 'resnet_skip_time_act', 'conv_out_kernel', 'transformer_layers_per_block', 'use_linear_projection', 'num_class_embeds', 'upcast_attention', 'conv_in_kernel', 'timestep_post_act', 'time_embedding_type', 'attention_type', 'mid_block_type', 'time_embedding_dim'} was not found in config. Values will be initialized to default values.
Some weights of the model checkpoint were not used when initializing UNet2DConditionModel: 
 ['conv_norm_out.weight, conv_norm_out.bias, conv_out.weight, conv_out.bias']
05/13/2024 12:50:17 - INFO - models.unet_3d - loaded temporal unet's pretrained weights from pretrained_models\stable-diffusion-v1-5\unet ...
{'motion_module_mid_block', 'use_linear_projection', 'num_class_embeds', 'upcast_attention', 'use_inflated_groupnorm', 'unet_use_cross_frame_attention', 'class_embed_type', 'motion_module_type', 'dual_cross_attention', 'only_cross_attention', 'motion_module_decoder_only', 'motion_module_kwargs', 'resnet_time_scale_shift', 'motion_module_resolutions'} was not found in config. Values will be initialized to default values.
05/13/2024 12:50:20 - INFO - models.unet_3d - Loaded 0.0M-parameter motion module
05/13/2024 12:50:25 - INFO - __main__ - Start training ...
05/13/2024 12:50:25 - INFO - __main__ - Num Samples: 1
05/13/2024 12:50:25 - INFO - __main__ - Train Batchsize: 1
05/13/2024 12:50:25 - INFO - __main__ - Num Epochs: 100000
05/13/2024 12:50:25 - INFO - __main__ - Total Steps: 100000
Steps:   0%|                                                                                                                                                                                           | 1/100000 [00:33<940:12:12, 33.85s/it]05/13/2024 12:51:00 - INFO - __main__ - Running validation ...
The passed generator was created on 'cpu' even though a tensor on cuda:0 was expected. Tensors will be created on 'cpu' and then moved to cuda:0. Note that one can probably slighly speed up this function by passing a generator that was created on the cuda:0 device.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [18:59<00:00, 57.00s/it]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00,  5.56s/it]
Steps:   0%|                                                                                                                                                         | 1/100000 [20:21<940:12:12, 33.85s/it, lr=1e-5, stage=1, step_loss=1.48]Traceback (most recent call last):
  File "C:\Users\user\code\champ\train_s1.py", line 675, in <module>
    main(config)
  File "C:\Users\user\code\champ\train_s1.py", line 495, in main
    model_pred = model(
  File "C:\Users\user\code\champ\env\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\user\code\champ\env\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\user\code\champ\env\lib\site-packages\accelerate\utils\operations.py", line 581, in forward
    return model_forward(*args, **kwargs)
  File "C:\Users\user\code\champ\env\lib\site-packages\accelerate\utils\operations.py", line 569, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "C:\Users\user\code\champ\env\lib\site-packages\torch\amp\autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "C:\Users\user\code\champ\models\champ_model.py", line 63, in forward
    model_pred = self.denoising_unet(
  File "C:\Users\user\code\champ\env\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\user\code\champ\env\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\user\code\champ\models\unet_3d.py", line 493, in forward
    sample, res_samples = downsample_block(
  File "C:\Users\user\code\champ\env\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\user\code\champ\env\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\user\code\champ\models\unet_3d_blocks.py", line 442, in forward
    hidden_states = attn(
  File "C:\Users\user\code\champ\env\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\user\code\champ\env\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\user\code\champ\models\transformer_3d.py", line 141, in forward
    hidden_states = block(
  File "C:\Users\user\code\champ\env\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\user\code\champ\env\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\user\code\champ\models\mutual_self_attention.py", line 181, in hacked_basic_transformer_inner_forward
    norm_hidden_states[_uc_mask],
IndexError: The shape of the mask [0] at index 0 does not match the shape of the indexed tensor [1, 9216, 320] at index 0
Steps:   0%|                                                                                                                                                     | 1/100000 [20:36<34353:38:21, 1236.74s/it, lr=1e-5, stage=1, step_loss=1.48]
Traceback (most recent call last):
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\user\code\champ\env\Scripts\accelerate.exe\__main__.py", line 7, in <module>
  File "C:\Users\user\code\champ\env\lib\site-packages\accelerate\commands\accelerate_cli.py", line 45, in main
    args.func(args)
  File "C:\Users\user\code\champ\env\lib\site-packages\accelerate\commands\launch.py", line 979, in launch_command
    simple_launcher(args)
  File "C:\Users\user\code\champ\env\lib\site-packages\accelerate\commands\launch.py", line 628, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['C:\\Users\\user\\code\\champ\\env\\Scripts\\python.exe', 'train_s1.py', '--config', 'configs/train/stage1.yaml']' returned non-zero exit status 1.

Here is my stage1.yaml.

exp_name: 'stage1'
output_dir: './exp_output'
seed: 42
resume_from_checkpoint: ''

checkpointing_steps: 2000
save_model_epoch_interval: 20

data:
  train_bs: 1
  video_folder: './training_data' # Your data root folder
  guids: 
    - 'depth'
    - 'normal'
    - 'semantic_map'
    - 'dwpose'
  image_size: 768
  bbox_crop: false
  bbox_resize_ratio: [0.9, 1.5]
  aug_type: "Resize"
  data_parts:
    - "all"
  sample_margin: 30

validation:
  validation_steps: 1000
  ref_images:
    - ./reference_imgs/images/ref-01.png
  guidance_folders:
    - ./training_data/1feec204f03a1a779085107b375df72a
  guidance_indexes: [0, 30, 60, 90, 120]            

solver:
  gradient_accumulation_steps: 1
  mixed_precision: 'fp16'
  enable_xformers_memory_efficient_attention: True 
  gradient_checkpointing: False 
  max_train_steps: 100000  # 50000
  max_grad_norm: 1.0
  # lr
  learning_rate: 1.0e-5
  scale_lr: False 
  lr_warmup_steps: 1
  lr_scheduler: 'constant'

  # optimizer
  use_8bit_adam: False 
  adam_beta1: 0.9
  adam_beta2: 0.999
  adam_weight_decay:  1.0e-2
  adam_epsilon: 1.0e-8

noise_scheduler_kwargs:
  num_train_timesteps: 1000
  beta_start:          0.00085
  beta_end:            0.012
  beta_schedule:       "scaled_linear"
  steps_offset:        1
  clip_sample:         false

guidance_encoder_kwargs:
  guidance_embedding_channels: 320
  guidance_input_channels: 3
  block_out_channels: [16, 32, 96, 256]

base_model_path: 'pretrained_models/stable-diffusion-v1-5'
vae_model_path: 'pretrained_models/sd-vae-ft-mse'
image_encoder_path: 'pretrained_models/image_encoder'

weight_dtype: 'fp16'  # [fp16, fp32]
uncond_ratio: 0.1
noise_offset: 0.05
snr_gamma: 5.0
enable_zero_snr: True 
@Leoooo333
Copy link
Member

Hi @pearbender , actually you don't need to set do_classifier_free_guidance to true when training even if you want to enable CFG.

During training, the Classifier-Free Guidance works by randomly sampling conditional and unconditional input ratio as uncond_ratio: 0.1. You can modify the ratio to 0 if you wanna disable CFG training.

During inference time, set do_classifer_free_guidance=True to enable CFG. You may also find cfg_scale helpful.

@pearbender
Copy link
Contributor Author

@Leoooo333 Currently during stage 1 training do_classifier_free_guidance is True by default causing the error I posed to occur. If it is OK to set to false during stage 1 training then the code should be changed, right?

@Beijia11
Copy link

Beijia11 commented Jul 1, 2024

Hi, I have also met this error in stage 1 training, is it all well to set do_classifier_free_guidance to false during stage 1 training? @pearbender

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants