You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
A clear and concise description of what the bug is.
I'm fine-tuning some pretrained models on the task ReCoRD.
During training, the progress bar indicates that it get stuck every certain rounds.
By "stuck" I mean the training process just pause for a while, still occupying the GPU memory, but GPU utilization is at 0%.
For example, while I try to fine-tune roberta-base on record, the training pause for a few minutes every 313 iterations.
To Reproduce
Tell use which version of jiant you're using
The current master branch of jiant.
Describe the environment where you're using jiant, e.g, "2 P40 GPUs"
1 RTX 3090 GPU, this problem also happened while I use 2 or more RTX 3090 GPUs.
Provide the experiment config artifact (e.g., defaults.conf)
Script used to train:
Describe the bug
A clear and concise description of what the bug is.
I'm fine-tuning some pretrained models on the task ReCoRD.
During training, the progress bar indicates that it get stuck every certain rounds.
By "stuck" I mean the training process just pause for a while, still occupying the GPU memory, but GPU utilization is at 0%.
For example, while I try to fine-tune roberta-base on record, the training pause for a few minutes every 313 iterations.
To Reproduce
Tell use which version of
jiant
you're usingThe current master branch of jiant.
Describe the environment where you're using
jiant
, e.g, "2 P40 GPUs"1 RTX 3090 GPU, this problem also happened while I use 2 or more RTX 3090 GPUs.
Provide the experiment config artifact (e.g.,
defaults.conf
)Script used to train:
The corresponding log file is shown below. In this test, the training process stuck exactly 313 rounds.
Expected behavior
A clear and concise description of what you expected to happen.
Training without intterupt, or at least show the reason why it's stuck.
Screenshots
If applicable, add screenshots to help explain your problem.
The only output I got was the log file shown above.
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: