-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running on 4 V100s, but epoch stays at 0 #23
Comments
Do you know how many iterations it's trained? One epoch consists if all possible video clips in the dataset. For UCF101, there are 9537 clips of length ~240, so that's around ~2 million clips, so a batch size of 32 would be around 60k iterations. If you're using lower batch sizes then it'd be even more iterations. I recommend training with flags |
By iterations, you mean how much of the epoch? When I checked, it had reached 100% and had completed the whole epoch, but no more than that. A whole day for 1 epoch seems extreme. How long does the model normally take to run? 100000 epochs is a lot, but I thought 4 V100s would suffice. Unless I did something wrong. |
4 V100s should suffice. By iterations, I mean training steps, so training for 200k iterations is only ~2-3 epochs. The total training time should be ~2 days. |
I'll try adding those val_check and limit_val commands. It's so strange, a single Epoch actually took >50 hrs: Epoch 0: 100%|██████████| 71360/71360 [50:55:33<00:00, 2.57s/it, loss=3.64, v_num=6635713, val/loss=3.690] |
This is on 4 V100s |
Hmm the 2.57s/it actually doesn't seem too off. You can get a good amount of speed-up (~2x) and lower gpu memory costs (~0.5x) if you train with sparse attention and/or mixed precision: |
I'm running the model on 4 V100 GPUs using SLURM and the following sbatch script:
However, after a full day, the logs still show the model stuck at epoch 0.
Do you know what's going wrong?
Thank you.
The text was updated successfully, but these errors were encountered: