how to account for batch size changing training results? #300

ghost · 2023-03-15T09:32:11Z

ghost
Mar 15, 2023

I was experimenting with 2 different GPU's using the same settings, aside from increasing batch size to utilize larger VRAM.

My assumption was that increasing batch size would allow for more work in parallel, potentially reducing training time.

What I found is that the results are different, the higher batch size decreases optimization steps, and the two models give very different results when used.

Is there a way to get the same training results across 2 different batch sizes?
The reason I think it is important is for guides.
UserA and UserB might follow the same guide and use the same settings, but UserB has more VRAM and increases batch size thinking it operates the same as when generating images (doing work in parallel).
But instead they will end up with different, possibly worse, results. And also with potential of increased training time depending on max_train_epochs or max_train_steps

the command used for reference (only batch size changed)
accelerate launch --num_cpu_threads_per_process=2 train_network.py --pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5 --train_data_dir=/workspace/train/ --output_dir=/workspace/ --output_name=test --caption_extension=.txt --save_model_as=safetensors --network_module=networks.lora --resolution=512,512 --text_encoder_lr=3e-5 --unet_lr=1.6e-5 --learning_rate=1.6e-5 --network_dim=256 --network_alpha=128 --lr_scheduler_num_cycles=3 --lr_scheduler=cosine_with_restarts --lr_warmup_steps=0 --train_batch_size=34 --max_train_epochs=8 --save_every_n_epochs=2 --mixed_precision=fp16 --save_precision=fp16 --optimizer_type=AdamW --max_token_length=150 --mem_eff_attn --xformers --enable_bucket --bucket_reso_steps=64 --bucket_no_upscale --random_crop --noise_offset=0.0

train_batch_size 12
max_train_epochs 8

override steps. steps for 8 epochs is / 指定エポックまでのステップ数: 536
running training / 学習開始
  num train images * repeats / 学習画像の数×繰り返し回数: 798
  num reg images / 正則化画像の数: 0
  num batches per epoch / 1epochのバッチ数: 67
  num epochs / epoch数: 8
  batch size per device / バッチサイズ: 12
  gradient accumulation steps / 勾配を合計するステップ数 = 1
  total optimization steps / 学習ステップ数: 536

train_batch_size 34
max_train_epochs 8

override steps. steps for 8 epochs is / 指定エポックまでのステップ数: 192
running training / 学習開始
  num train images * repeats / 学習画像の数×繰り返し回数: 798
  num reg images / 正則化画像の数: 0
  num batches per epoch / 1epochのバッチ数: 24
  num epochs / epoch数: 8
  batch size per device / バッチサイズ: 34
  gradient accumulation steps / 勾配を合計するステップ数 = 1
  total optimization steps / 学習ステップ数: 192

ghost · 2023-03-22T12:31:18Z

ghost
Mar 22, 2023

after working with it more, I realize it is not an issue, もんだいないです～！

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to account for batch size changing training results? #300

{{title}}

Replies: 1 comment

{{title}}

Select a reply

how to account for batch size changing training results? #300

ghost Mar 15, 2023

Replies: 1 comment

ghost Mar 22, 2023

ghost
Mar 15, 2023

ghost
Mar 22, 2023