training so slow？and more time when using more GPU？ #107

huangjch526 · 2024-09-04T05:25:53Z

I use the default train.sh as
accelerate launch --mixed_precision='bf16' scripts/train.py
--pretrained_model_name_or_path=$MODEL_NAME
--train_data_dir=$DATASET_NAME
--train_data_meta=$DATASET_META_NAME
--config_path "config/easyanimate_video_slicevae_multi_text_encoder_v4.yaml"
--image_sample_size=512
--video_sample_size=512
--token_sample_size=512
--video_sample_stride=1
--video_sample_n_frames=144
--train_batch_size=1
--video_repeat=1
--gradient_accumulation_steps=1
--dataloader_num_workers=8
--num_train_epochs=100
--checkpointing_steps=500
--learning_rate=2e-05
--lr_scheduler="constant_with_warmup"
--lr_warmup_steps=100
--seed=42
--output_dir="output_dir/ft_0.1Mv"
--enable_xformers_memory_efficient_attention
--gradient_checkpointing
--mixed_precision="bf16"
--adam_weight_decay=3e-2
--adam_epsilon=1e-10
--vae_mini_batch=1
--max_grad_norm=0.05
--random_hw_adapt
--training_with_video_token_length
--motion_sub_loss
--not_sigma_loss
--random_frame_crop
--enable_bucket
--train_mode="inpaint"
--trainable_modules "."

My situation is eight A100 80G, each batchsize 1, 15s/it, I feel it very slow.
When I use only one A100 80G, each batchsize 1, 4s/it, I feel it very fast.

Using eight A100 80G: backward loss time: 5.38s, max_grad_norm time: 4.36s
Using one A100 80G: backward loss time: 1.27s, max_grad_norm time: 0.04s

Is that normal? Any way to be faster?

bubbliiiing · 2024-09-04T06:20:31Z

Due to the presence of random cropping strategies, and because some data may be long while others are short, when using a single GPU, a video might be very short, for example, 512x512x16, in which case the speed is quite fast. However, with multiple GPUs, it’s likely that at least one GPU will sample 512x512x144, which results in a slower speed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training so slow？and more time when using more GPU？ #107

training so slow？and more time when using more GPU？ #107

huangjch526 commented Sep 4, 2024

bubbliiiing commented Sep 4, 2024

training so slow？and more time when using more GPU？ #107

training so slow？and more time when using more GPU？ #107

Comments

huangjch526 commented Sep 4, 2024

bubbliiiing commented Sep 4, 2024