Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

buffer.grad is not None 请问这个错怎么解决 #34

Closed
thunder95 opened this issue Sep 9, 2024 · 6 comments
Closed

buffer.grad is not None 请问这个错怎么解决 #34

thunder95 opened this issue Sep 9, 2024 · 6 comments
Labels
good first issue Good for newcomers

Comments

@thunder95
Copy link

执行双卡训练命令: python -m torch.distributed.run --nproc_per_node=2 train_pipeline.py --cfg-path lavis/projects/pp_qwen14b/train_pp.yaml --num-stages 2

File "/data/workspace/MPP-LLaVA/train_pipeline.py", line 228, in main
loss = engine.train_batch(data_iter=train_iter)
File "/data/hulei/miniconda3/envs/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 388, in train_batch
self._exec_schedule(sched)
File "/data/hulei/miniconda3/envs/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 1422, in _exec_schedule
self._exec_instr(**cmd.kwargs)
File "/data/hulei/miniconda3/envs/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 1102, in _exec_send_grads
assert buffer.grad is not None

@Coobiw
Copy link
Owner

Coobiw commented Sep 9, 2024

我没有遇到过这个问题,方便截图看一下你的pipelayer的分布嘛

@thunder95
Copy link
Author

辛苦大佬 代码没有改动。报了一个warning: /torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")

pipelayer没有改动过

[2024-09-10 09:18:35,522] [INFO] [module.py:396:_partition_layers] Partitioning pipeline stages with method uniform
stage=0 layers=25
0: TokenizerPipeLayer
1: IndentityPipeLayer
2: IndentityPipeLayer
3: IndentityPipeLayer
4: IndentityPipeLayer
5: QwenBlockPipeLayer
6: QwenBlockPipeLayer
7: QwenBlockPipeLayer
8: QwenBlockPipeLayer
9: QwenBlockPipeLayer
10: QwenBlockPipeLayer
11: QwenBlockPipeLayer
12: QwenBlockPipeLayer
13: QwenBlockPipeLayer
14: QwenBlockPipeLayer
15: QwenBlockPipeLayer
16: QwenBlockPipeLayer
17: QwenBlockPipeLayer
18: QwenBlockPipeLayer
19: QwenBlockPipeLayer
20: QwenBlockPipeLayer
21: QwenBlockPipeLayer
22: QwenBlockPipeLayer
23: QwenBlockPipeLayer
24: QwenBlockPipeLayer
stage=1 layers=24
25: QwenBlockPipeLayer
26: QwenBlockPipeLayer
27: QwenBlockPipeLayer
28: QwenBlockPipeLayer
29: QwenBlockPipeLayer
30: QwenBlockPipeLayer
31: QwenBlockPipeLayer
32: QwenBlockPipeLayer
33: QwenBlockPipeLayer
34: QwenBlockPipeLayer
35: QwenBlockPipeLayer
36: QwenBlockPipeLayer
37: QwenBlockPipeLayer
38: QwenBlockPipeLayer
39: QwenBlockPipeLayer
40: QwenBlockPipeLayer
41: QwenBlockPipeLayer
42: QwenBlockPipeLayer
43: QwenBlockPipeLayer
44: QwenBlockPipeLayer
45: FLNPipeLayer
46: LMPipeLayer
47: LossPipeLayer
48: IndentityPipeLayerLast
GPU1 Trainable Params: 1000000

@Coobiw
Copy link
Owner

Coobiw commented Sep 10, 2024

看起来没什么问题,要不检查下关键库的版本吧(torch、transformers、accelerate、deepspeed等)

@thunder95
Copy link
Author

都是按照requirements重新创建的conda环境。
应该是deepspeed的版本可能有差异,我用了最新版的以及requirements.txt里的deepspeed==0.13.5都不对。
具体错误的地方在于attention_mask没有梯度,deepspeed里稍微处理了(假设attentionmask在最后一个)


# Drop the attention mask from the input buffer here. It does not have
# a grad that needs to be communicated. We free the buffer immediately
# after, so no need to restore it. The receiver also has a hack that skips
# the recv. This is because NCCL does not let us send torch.BoolTensor :-(.
if self.has_attention_mask or self.has_bool_tensors:
    inputs = list(inputs)
    inputs.pop()
    inputs = tuple(inputs)

大佬,你当时跑用的哪个版本deepspeed

@Coobiw
Copy link
Owner

Coobiw commented Sep 11, 2024

https://github.com/Coobiw/MPP-LLaVA/blob/master/lavis/models/minigpt4qwen_models/minigpt4qwen_pipe.py#L138

attention_mask的梯度问题我之前也遇到过,上面的代码行已经将attention_mask的requires_grad设成True了,我后续就遇到相关的什么问题了,你可以debug看下可能是其他哪里的问题

deepspeed版本的话:deepspeed==0.13.5

@thunder95
Copy link
Author

deepspeed处理这部分有些问题,我注释了这两行, 然后在deepspeed里梯度传递的时候过滤掉requires_grad了就可以 @Coobiw
rotary_pos_emb_list.requires_grad_(True)
attention_mask.requires_grad_(True)

@Coobiw Coobiw added the good first issue Good for newcomers label Sep 11, 2024
@Coobiw Coobiw pinned this issue Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants