-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loss of qav
and vaq
came to nan
quickly.
#23
Comments
I changed codes near vqa_loss, vaq_loss, qav_loss = model(data)
print(f"vqa_loss: {vqa_loss}, vaq_loss: {vaq_loss}, qav_loss: {qav_loss}") And here is the log:
|
qav
and vaq
came to nan
quickly.
If you using one GPU rather than 8 GPUs, then I recommend using |
I saw your comment last night and changed |
The problem still exist:
Here is the commands: python train.py --model 7B \
--max_seq_len 650 --batch_size 1 --epochs 10 --warmup_epochs 2 --bias 3 --tau 100. --max_feats 10 --dataset tvqa \
--blr 1e-4 --weight_decay 0.02 --output_dir ./checkpoint/tvqa --dataset tvqa --accum_iter 32 --sub --vaq --qav |
If run codes with autonomy detection enabled: with torch.autograd.detect_anomaly():
vqa_loss, vaq_loss, qav_loss = model(data) It can't detect where the
🤔 |
But according to the printed, loss is not nan.
Command is the training command in README with some arguments about distributed training removed:
The text was updated successfully, but these errors were encountered: