You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While training the DIT model on the FreeSound1 (I choose 250k audio clips) and FMA datasets, the training hangs after processing approximately 198,400 samples (with 8 GPUs, batch size of 8 per GPU, reaching 3,100 steps). After some time, an NCCL communication timeout occurs. I tested lowering the batch size to 6, but the same issue appeared after processing 198,600 samples (~4,100 steps). Interestingly, when I reduce the total number of samples in the FreeSound dataset to 150k, training proceeds without issues. Could this be related to the dataset size or NCCL synchronization across GPUs?
During NCCL communication wait, half of the GPUs show 0% utilization while the other half show max utilization, but in reality, none of the GPUs are working (power consumption is the same as in idle state).
Here are my training logs from WandB. As shown, the training loss stopped updating at 3000 steps, but memory usage continued to be logged. I’ve already ruled out dataset issues and CUDA out of memory errors.
The text was updated successfully, but these errors were encountered:
shwj114514
changed the title
Training hangs after processing 198,400 samples on FreeSound and FMA datasets with DIT model
Training hangs after processing 200,000 samples on FreeSound and FMA datasets with DIT model
Sep 27, 2024
What kind of dataset you using? samples/local webdataset/s3 webdataset?
can you send along the GPU memory utilization charts as well?
Are you using default multi-gpu strategy, or deepspeed?
anything different about your conditioning signals?
I've noticed memory leaks in the loader quite a bit, especially when using custom metadata modules. Usually solved by reducing num workers, but it appears your num workers is already set reasonably low I think
While training the DIT model on the FreeSound1 (I choose 250k audio clips) and FMA datasets, the training hangs after processing approximately 198,400 samples (with 8 GPUs, batch size of 8 per GPU, reaching 3,100 steps). After some time, an NCCL communication timeout occurs. I tested lowering the batch size to 6, but the same issue appeared after processing 198,600 samples (~4,100 steps). Interestingly, when I reduce the total number of samples in the FreeSound dataset to 150k, training proceeds without issues. Could this be related to the dataset size or NCCL synchronization across GPUs?
During NCCL communication wait, half of the GPUs show 0% utilization while the other half show max utilization, but in reality, none of the GPUs are working (power consumption is the same as in idle state).
Here are my training logs from WandB. As shown, the training loss stopped updating at 3000 steps, but memory usage continued to be logged. I’ve already ruled out dataset issues and CUDA out of memory errors.
The text was updated successfully, but these errors were encountered: