buffer.grad is not None 请问这个错怎么解决 #34

thunder95 · 2024-09-09T09:15:14Z

执行双卡训练命令: python -m torch.distributed.run --nproc_per_node=2 train_pipeline.py --cfg-path lavis/projects/pp_qwen14b/train_pp.yaml --num-stages 2

File "/data/workspace/MPP-LLaVA/train_pipeline.py", line 228, in main
loss = engine.train_batch(data_iter=train_iter)
File "/data/hulei/miniconda3/envs/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 388, in train_batch
self._exec_schedule(sched)
File "/data/hulei/miniconda3/envs/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 1422, in _exec_schedule
self._exec_instr(**cmd.kwargs)
File "/data/hulei/miniconda3/envs/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 1102, in _exec_send_grads
assert buffer.grad is not None

Coobiw · 2024-09-09T17:33:54Z

我没有遇到过这个问题，方便截图看一下你的pipelayer的分布嘛

thunder95 · 2024-09-10T01:28:06Z

辛苦大佬代码没有改动。报了一个warning: /torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")

pipelayer没有改动过

[2024-09-10 09:18:35,522] [INFO] [module.py:396:_partition_layers] Partitioning pipeline stages with method uniform
stage=0 layers=25
0: TokenizerPipeLayer
1: IndentityPipeLayer
2: IndentityPipeLayer
3: IndentityPipeLayer
4: IndentityPipeLayer
5: QwenBlockPipeLayer
6: QwenBlockPipeLayer
7: QwenBlockPipeLayer
8: QwenBlockPipeLayer
9: QwenBlockPipeLayer
10: QwenBlockPipeLayer
11: QwenBlockPipeLayer
12: QwenBlockPipeLayer
13: QwenBlockPipeLayer
14: QwenBlockPipeLayer
15: QwenBlockPipeLayer
16: QwenBlockPipeLayer
17: QwenBlockPipeLayer
18: QwenBlockPipeLayer
19: QwenBlockPipeLayer
20: QwenBlockPipeLayer
21: QwenBlockPipeLayer
22: QwenBlockPipeLayer
23: QwenBlockPipeLayer
24: QwenBlockPipeLayer
stage=1 layers=24
25: QwenBlockPipeLayer
26: QwenBlockPipeLayer
27: QwenBlockPipeLayer
28: QwenBlockPipeLayer
29: QwenBlockPipeLayer
30: QwenBlockPipeLayer
31: QwenBlockPipeLayer
32: QwenBlockPipeLayer
33: QwenBlockPipeLayer
34: QwenBlockPipeLayer
35: QwenBlockPipeLayer
36: QwenBlockPipeLayer
37: QwenBlockPipeLayer
38: QwenBlockPipeLayer
39: QwenBlockPipeLayer
40: QwenBlockPipeLayer
41: QwenBlockPipeLayer
42: QwenBlockPipeLayer
43: QwenBlockPipeLayer
44: QwenBlockPipeLayer
45: FLNPipeLayer
46: LMPipeLayer
47: LossPipeLayer
48: IndentityPipeLayerLast
GPU1 Trainable Params: 1000000

Coobiw · 2024-09-10T01:51:02Z

看起来没什么问题，要不检查下关键库的版本吧（torch、transformers、accelerate、deepspeed等）

thunder95 · 2024-09-11T01:35:04Z

都是按照requirements重新创建的conda环境。
应该是deepspeed的版本可能有差异，我用了最新版的以及requirements.txt里的deepspeed==0.13.5都不对。
具体错误的地方在于attention_mask没有梯度，deepspeed里稍微处理了(假设attentionmask在最后一个)


# Drop the attention mask from the input buffer here. It does not have
# a grad that needs to be communicated. We free the buffer immediately
# after, so no need to restore it. The receiver also has a hack that skips
# the recv. This is because NCCL does not let us send torch.BoolTensor :-(.
if self.has_attention_mask or self.has_bool_tensors:
    inputs = list(inputs)
    inputs.pop()
    inputs = tuple(inputs)

大佬，你当时跑用的哪个版本deepspeed

Coobiw · 2024-09-11T02:52:29Z

https://github.com/Coobiw/MPP-LLaVA/blob/master/lavis/models/minigpt4qwen_models/minigpt4qwen_pipe.py#L138

attention_mask的梯度问题我之前也遇到过，上面的代码行已经将attention_mask的requires_grad设成True了，我后续就遇到相关的什么问题了，你可以debug看下可能是其他哪里的问题

deepspeed版本的话：deepspeed==0.13.5

thunder95 · 2024-09-11T02:53:52Z

deepspeed处理这部分有些问题，我注释了这两行，然后在deepspeed里梯度传递的时候过滤掉requires_grad了就可以 @Coobiw
rotary_pos_emb_list.requires_grad_(True)
attention_mask.requires_grad_(True)

thunder95 closed this as completed Sep 11, 2024

Coobiw added the good first issue Good for newcomers label Sep 11, 2024

Coobiw pinned this issue Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

buffer.grad is not None 请问这个错怎么解决 #34

buffer.grad is not None 请问这个错怎么解决 #34

thunder95 commented Sep 9, 2024

Coobiw commented Sep 9, 2024

thunder95 commented Sep 10, 2024

Coobiw commented Sep 10, 2024

thunder95 commented Sep 11, 2024

Coobiw commented Sep 11, 2024

thunder95 commented Sep 11, 2024

buffer.grad is not None 请问这个错怎么解决 #34

buffer.grad is not None 请问这个错怎么解决 #34

Comments

thunder95 commented Sep 9, 2024

Coobiw commented Sep 9, 2024

thunder95 commented Sep 10, 2024

Coobiw commented Sep 10, 2024

thunder95 commented Sep 11, 2024

Coobiw commented Sep 11, 2024

thunder95 commented Sep 11, 2024