Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss of qav and vaq came to nan quickly. #23

Open
chuanwise opened this issue Aug 1, 2024 · 6 comments
Open

Loss of qav and vaq came to nan quickly. #23

chuanwise opened this issue Aug 1, 2024 · 6 comments

Comments

@chuanwise
Copy link

chuanwise commented Aug 1, 2024

Not using distributed mode
[18:17:54.955532] job dir: /home/23031212503/projects/Flipped-VQA
[18:17:54.955618] Namespace(batch_size=1,
epochs=5,
accum_iter=4,
llama_model_path='./pretrained/llama/',
model='7B',
adapter_layer=32,
adapter_len=10,
max_seq_len=650,
max_feats=10,
weight_decay=0.02,
lr=None,
blr=0.07,
min_lr=0.0,
warmup_epochs=2,
dataset='tvqa',
output_dir='./checkpoint/tvqa',
device='cuda',
seed=0,
resume='',
start_epoch=0,
num_workers=2,
pin_mem=True,
world_size=1,
local_rank=-1,
dist_on_itp=False,
dist_url='env://',
vaq=True,
qav=True,
bias=3.0,
tau=100.0,
sub=True,
distributed=False)
[18:18:16.740925] Num train data: 122039
[18:18:24.026051] Num val data: 15253
[18:18:24.039350] Using model: 7B
[18:18:24.041255] loading from pretrained/llama/7B/consolidated.00.pth
[18:19:13.553202] base lr: 7.00e-02
[18:19:13.553243] actual lr: 1.09e-03
[18:19:13.553254] accumulate grad iterations: 4
[18:19:13.553258] effective batch size: 4
[18:19:13.554187] AdamW (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.95)
    capturable: False
    eps: 1e-08
    foreach: None
    lr: 0.00109375
    maximize: False
    weight_decay: 0.0

Parameter Group 1
    amsgrad: False
    betas: (0.9, 0.95)
    capturable: False
    eps: 1e-08
    foreach: None
    lr: 0.00109375
    maximize: False
    weight_decay: 0.02
)
[18:19:13.554305] Start training for 5 epochs
[18:19:17.576096] Epoch: [0]  [     0/122039]  eta: 5 days, 16:15:56  lr: 0.000000  loss: 5.6871 (5.6871)  vqa_loss: 1.4844 (1.4844)  vaq_loss: 1.8125 (1.8125)  qav_loss: 2.3903 (2.3903)  time: 4.0197  data: 0.7782  max mem: 37679
[18:19:23.617162] Loss is nan, stopping training

But according to the printed, loss is not nan.

Command is the training command in README with some arguments about distributed training removed:

python train.py --model 7B --max_seq_len 650 --batch_size 1 --epochs 5 --warmup_epochs 2 --bias 3 --tau 100. --max_feats 10 --dataset tvqa --blr 7e-2 --weight_decay 0.02 --output_dir ./checkpoint/tvqa --dataset tvqa --accum_iter 4 --sub --vaq --qav
@chuanwise
Copy link
Author

I changed codes near engine.py:25 to:

vqa_loss, vaq_loss, qav_loss = model(data)
print(f"vqa_loss: {vqa_loss}, vaq_loss: {vaq_loss}, qav_loss: {qav_loss}")

And here is the log:

[19:52:42.854386] vqa_loss: 1.484375, vaq_loss: 1.8125, qav_loss: 2.3902640342712402
[19:52:43.300642] Epoch: [0]  [     0/122039]  eta: 4 days, 4:45:00  lr: 0.000000  loss: 5.6871 (5.6871)  vqa_loss: 1.4844 (1.4844)  vaq_loss: 1.8125 (1.8125)  qav_loss: 2.3903 (2.3903)  time: 2.9720  data: 0.8503  max mem: 37679
[19:52:43.623051] vqa_loss: 1.6328125, vaq_loss: 3.232421875, qav_loss: 2.3078742027282715
[19:52:44.408110] vqa_loss: 1.8759765625, vaq_loss: 2.876953125, qav_loss: 2.1542460918426514
[19:52:45.173799] vqa_loss: 1.55078125, vaq_loss: 1.603515625, qav_loss: 2.2987637519836426
[19:52:45.960194] vqa_loss: 1.5166015625, vaq_loss: 2.427734375, qav_loss: 2.203843355178833
[19:52:46.725876] vqa_loss: 1.6318359375, vaq_loss: 2.048828125, qav_loss: 2.2658228874206543
[19:52:47.497957] vqa_loss: 1.5791015625, vaq_loss: 1.6123046875, qav_loss: 2.2287609577178955
[19:52:48.268652] vqa_loss: 1.673828125, vaq_loss: 1.6904296875, qav_loss: 2.201247215270996
[19:52:49.053853] vqa_loss: 1.3828125, vaq_loss: 2.248046875, qav_loss: 2.5104265213012695
[19:52:49.822032] vqa_loss: 1.998046875, vaq_loss: 2.029296875, qav_loss: nan
[19:52:49.826290] Loss is nan, stopping training

@chuanwise chuanwise changed the title Loss is nan, but it isn’t nan. Loss of qav and vaq came to nan quickly. Aug 1, 2024
@ikodoh
Copy link
Contributor

ikodoh commented Aug 1, 2024

If you using one GPU rather than 8 GPUs, then I recommend using --accum_iter 32, since the batch size is decreased by 8 times. Or you may use the lower blr. The current loss seems to become diverged due to the large blr but with a small batch size.

@chuanwise
Copy link
Author

I saw your comment last night and changed blr to 1e-4, but the problem still exist.
Now I'm trying to use --accum_iter 32. 🤔

@chuanwise
Copy link
Author

The problem still exist:

Not using distributed mode
[09:30:04.319382] job dir: /home/23031212503/projects/Flipped-VQA
[09:30:04.319483] Namespace(batch_size=1,
epochs=10,
accum_iter=32,
llama_model_path='./pretrained/llama/',
model='7B',
adapter_layer=32,
adapter_len=10,
max_seq_len=650,
max_feats=10,
weight_decay=0.02,
lr=None,
blr=0.0001,
min_lr=0.0,
warmup_epochs=2,
dataset='tvqa',
output_dir='./checkpoint/tvqa',
device='cuda',
seed=0,
resume='',
start_epoch=0,
num_workers=2,
pin_mem=True,
world_size=1,
local_rank=-1,
dist_on_itp=False,
dist_url='env://',
vaq=True,
qav=True,
bias=3.0,
tau=100.0,
sub=True,
distributed=False)
[09:30:29.629968] Num train data: 122039
[09:30:37.490100] Num val data: 15253
[09:30:37.506190] Using model: 7B
[09:30:37.514400] loading from pretrained/llama/7B/consolidated.00.pth
[09:31:27.752191] base lr: 1.00e-04
[09:31:27.752239] actual lr: 1.25e-05
[09:31:27.752249] accumulate grad iterations: 32
[09:31:27.752253] effective batch size: 32
[09:31:27.753421] AdamW (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.95)
    capturable: False
    eps: 1e-08
    foreach: None
    lr: 1.25e-05
    maximize: False
    weight_decay: 0.0

Parameter Group 1
    amsgrad: False
    betas: (0.9, 0.95)
    capturable: False
    eps: 1e-08
    foreach: None
    lr: 1.25e-05
    maximize: False
    weight_decay: 0.02
)
[09:31:27.753595] Start training for 10 epochs
[09:31:31.378744] vqa_loss: 1.484375, vaq_loss: 1.8125, qav_loss: 2.3902640342712402
[09:31:31.814755] Epoch: [0]  [     0/122039]  eta: 5 days, 17:34:47  lr: 0.000000  loss: 5.6871 (5.6871)  vqa_loss: 1.4844 (1.4844)  vaq_loss: 1.8125 (1.8125)  qav_loss: 2.3903 (2.3903)  time: 4.0584  data: 0.8839  max mem: 37679
[09:31:32.117008] vqa_loss: 1.6328125, vaq_loss: 3.232421875, qav_loss: 2.3078742027282715
[09:31:32.972378] vqa_loss: 1.8759765625, vaq_loss: 2.876953125, qav_loss: 2.1542460918426514
[09:31:33.859110] vqa_loss: 1.55078125, vaq_loss: 1.603515625, qav_loss: 2.2987637519836426
[09:31:34.591741] vqa_loss: 1.5166015625, vaq_loss: 2.427734375, qav_loss: 2.203843355178833
[09:31:35.327070] vqa_loss: 1.6318359375, vaq_loss: 2.048828125, qav_loss: 2.2658228874206543
[09:31:36.051778] vqa_loss: 1.5791015625, vaq_loss: 1.6123046875, qav_loss: 2.2287609577178955
[09:31:36.778938] vqa_loss: 1.673828125, vaq_loss: 1.6904296875, qav_loss: 2.201247215270996
[09:31:37.510163] vqa_loss: 1.3828125, vaq_loss: 2.248046875, qav_loss: 2.5104265213012695
[09:31:38.236347] vqa_loss: 1.998046875, vaq_loss: 2.029296875, qav_loss: nan
[09:31:38.242391] Loss is nan, stopping training

Here is the commands:

python train.py --model 7B \
    --max_seq_len 650 --batch_size 1 --epochs 10 --warmup_epochs 2 --bias 3 --tau 100. --max_feats 10 --dataset tvqa \
    --blr 1e-4 --weight_decay 0.02 --output_dir ./checkpoint/tvqa --dataset tvqa --accum_iter 32 --sub --vaq --qav

@chuanwise
Copy link
Author

If run codes with autonomy detection enabled:

with torch.autograd.detect_anomaly():
    vqa_loss, vaq_loss, qav_loss = model(data)

It can't detect where the nan come from.

(video-question-answering) [23031212503@login01 Flipped-VQA]$ tail err.90258*
==> err.902582.log <==
/home/23031212503/projects/Flipped-VQA/engine.py:25: UserWarning: Anomaly Detection has been enabled. This mode will increase the runtime and should only be enabled for debugging.
  with torch.autograd.detect_anomaly():

==> err.902583.log <==
/home/23031212503/projects/Flipped-VQA/engine.py:25: UserWarning: Anomaly Detection has been enabled. This mode will increase the runtime and should only be enabled for debugging.
  with torch.autograd.detect_anomaly():

==> err.902584.log <==
/home/23031212503/projects/Flipped-VQA/engine.py:25: UserWarning: Anomaly Detection has been enabled. This mode will increase the runtime and should only be enabled for debugging.
  with torch.autograd.detect_anomaly():
(video-question-answering) [23031212503@login01 Flipped-VQA]$ tail out.90258*
==> out.902582.log <==
[09:43:20.634369] vqa_loss: 1.6328125, vaq_loss: 3.232421875, qav_loss: 2.3078742027282715
[09:43:21.808103] vqa_loss: 1.8759765625, vaq_loss: 2.876953125, qav_loss: 2.1542460918426514
[09:43:22.990304] vqa_loss: 1.55078125, vaq_loss: 1.603515625, qav_loss: 2.2987637519836426
[09:43:24.292305] vqa_loss: 1.5166015625, vaq_loss: 2.427734375, qav_loss: 2.203843355178833
[09:43:25.474479] vqa_loss: 1.6318359375, vaq_loss: 2.048828125, qav_loss: 2.2658228874206543
[09:43:26.666428] vqa_loss: 1.5791015625, vaq_loss: 1.6123046875, qav_loss: 2.2287609577178955
[09:43:27.858821] vqa_loss: 1.673828125, vaq_loss: 1.6904296875, qav_loss: 2.201247215270996
[09:43:29.167579] vqa_loss: 1.3828125, vaq_loss: 2.248046875, qav_loss: 2.5104265213012695
[09:43:30.356480] vqa_loss: 1.998046875, vaq_loss: 2.029296875, qav_loss: nan
[09:43:30.367149] Loss is nan, stopping training

==> out.902583.log <==
[09:48:09.904082] vqa_loss: 2.009765625, vaq_loss: 2.48828125, qav_loss: 2.3202672004699707
[09:48:11.400204] vqa_loss: 1.90234375, vaq_loss: 2.208984375, qav_loss: 2.3291122913360596
[09:48:12.899157] vqa_loss: 2.033203125, vaq_loss: 2.5859375, qav_loss: 2.3255414962768555
[09:48:14.365934] vqa_loss: 1.8984375, vaq_loss: 2.486328125, qav_loss: 2.328566789627075
[09:48:15.935122] vqa_loss: 1.8720703125, vaq_loss: 2.34375, qav_loss: 2.3454670906066895
[09:48:17.412503] vqa_loss: 1.9130859375, vaq_loss: 2.47265625, qav_loss: 2.3222508430480957
[09:48:18.888214] vqa_loss: 2.033203125, vaq_loss: 2.328125, qav_loss: 2.3213603496551514
[09:48:20.486417] vqa_loss: 1.91796875, vaq_loss: 2.65625, qav_loss: 2.34771728515625
[09:48:21.978990] vqa_loss: 1.939453125, vaq_loss: 2.6015625, qav_loss: nan
[09:48:21.999707] Loss is nan, stopping training

==> out.902584.log <==
[09:44:14.089460] Start training for 5 epochs
[09:44:16.338710] vqa_loss: 1.9150390625, vaq_loss: 1.8759765625, qav_loss: 2.407137393951416
[09:44:16.696111] Epoch: [0]  [   0/9233]  eta: 6:40:47  lr: 0.000000  loss: 6.1982 (6.1982)  vqa_loss: 1.9150 (1.9150)  vaq_loss: 1.8760 (1.8760)  qav_loss: 2.4071 (2.4071)  time: 2.6045  data: 0.1838  max mem: 32770
[09:44:17.516800] vqa_loss: 1.984375, vaq_loss: 1.0400390625, qav_loss: 2.2512660026550293
[09:44:18.640382] vqa_loss: 1.8154296875, vaq_loss: 1.4052734375, qav_loss: 2.313685655593872
[09:44:19.772095] vqa_loss: 1.72265625, vaq_loss: 1.5458984375, qav_loss: 2.22432017326355
[09:44:21.041153] vqa_loss: 1.9140625, vaq_loss: 1.4482421875, qav_loss: 2.2694244384765625
[09:44:22.171438] vqa_loss: 1.8427734375, vaq_loss: 2.1640625, qav_loss: 2.3169543743133545
[09:44:23.298976] vqa_loss: 1.81640625, vaq_loss: 1.9326171875, qav_loss: nan
[09:44:23.310836] Loss is nan, stopping training

🤔

@ikodoh
Copy link
Contributor

ikodoh commented Aug 2, 2024

Screenshot 2024-08-01 at 11 13 03 PM In my environment, the model is trained well with the below command:
python train.py --model 7B \
--max_seq_len 650 --batch_size 1 --epochs 5 --warmup_epochs 2 --bias 3 --tau 100. --max_feats 10 --dataset tvqa \
--blr 7e-2 --weight_decay 0.02 --output_dir ./checkpoint/tvqa --dataset tvqa --accum_iter 4 --sub --vaq --qav

Please make sure the environment of your machine is the same as the README.md.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants