Running on 4 V100s, but epoch stays at 0 #23

slerman12 · 2021-08-26T00:16:15Z

I'm running the model on 4 V100 GPUs using SLURM and the following sbatch script:

#!/bin/bash
#SBATCH -p gpu
#SBATCH --gres=gpu:4
#SBATCH -p reserved --reservation=slerman-20210821 -t 2-00:00:00 
#SBATCH -t 5-00:00:00 -o ./vgpt.log -J vgpt
#SBATCH --mem=50gb 
#SBATCH -C V100
source /scratch/slerman/miniconda/bin/activate vid
python3 train_videogpt.py --max_steps 200000 --vqvae ucf101_stride4x4x4 --data_path ./datasets/ucf101/ --gpus 4

However, after a full day, the logs still show the model stuck at epoch 0.

Do you know what's going wrong?

Thank you.

The text was updated successfully, but these errors were encountered:

wilson1yan · 2021-08-26T00:40:42Z

Do you know how many iterations it's trained?

One epoch consists if all possible video clips in the dataset. For UCF101, there are 9537 clips of length ~240, so that's around ~2 million clips, so a batch size of 32 would be around 60k iterations. If you're using lower batch sizes then it'd be even more iterations.

I recommend training with flags --val_check_interval 5000 (# of iterations between val checks) and --limit_val_batches 500 (or some lower but reasonable number since it might take a long time for the whole set)

slerman12 · 2021-08-26T18:08:30Z

By iterations, you mean how much of the epoch? When I checked, it had reached 100% and had completed the whole epoch, but no more than that. A whole day for 1 epoch seems extreme. How long does the model normally take to run? 100000 epochs is a lot, but I thought 4 V100s would suffice. Unless I did something wrong.

wilson1yan · 2021-08-26T18:48:00Z

4 V100s should suffice. By iterations, I mean training steps, so training for 200k iterations is only ~2-3 epochs. The total training time should be ~2 days.

slerman12 · 2021-09-27T19:12:16Z

I'll try adding those val_check and limit_val commands. It's so strange, a single Epoch actually took >50 hrs:

Epoch 0: 100%|██████████| 71360/71360 [50:55:33<00:00, 2.57s/it, loss=3.64, v_num=6635713, val/loss=3.690]

slerman12 · 2021-09-27T19:12:35Z

This is on 4 V100s

wilson1yan · 2021-09-27T19:40:21Z

Hmm the 2.57s/it actually doesn't seem too off. You can get a good amount of speed-up (~2x) and lower gpu memory costs (~0.5x) if you train with sparse attention and/or mixed precision: attn_type 'sparse' --amp_level O1 --precision 16 (sparse attention will need to be installed before using it). Though I've found that sometimes the mixed precision is unstable depending on what kind of model is being trained (i.e. usually works with unconditional, but seems to be more unstable when training class-conditional).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running on 4 V100s, but epoch stays at 0 #23

Running on 4 V100s, but epoch stays at 0 #23

slerman12 commented Aug 26, 2021

wilson1yan commented Aug 26, 2021 •

edited

Loading

slerman12 commented Aug 26, 2021 •

edited

Loading

wilson1yan commented Aug 26, 2021

slerman12 commented Sep 27, 2021

slerman12 commented Sep 27, 2021

wilson1yan commented Sep 27, 2021 •

edited

Loading

Running on 4 V100s, but epoch stays at 0 #23

Running on 4 V100s, but epoch stays at 0 #23

Comments

slerman12 commented Aug 26, 2021

wilson1yan commented Aug 26, 2021 • edited Loading

slerman12 commented Aug 26, 2021 • edited Loading

wilson1yan commented Aug 26, 2021

slerman12 commented Sep 27, 2021

slerman12 commented Sep 27, 2021

wilson1yan commented Sep 27, 2021 • edited Loading

wilson1yan commented Aug 26, 2021 •

edited

Loading

slerman12 commented Aug 26, 2021 •

edited

Loading

wilson1yan commented Sep 27, 2021 •

edited

Loading