Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPM Bee 微调时设置 half 出现 CUDA 报错,不设置 half 则 assert 报错 #79

Open
YingLaiLin opened this issue Jun 17, 2023 · 3 comments

Comments

@YingLaiLin
Copy link

CPM 使用微调脚本训练, 不开启 --use-delta 这一选项,则出现如下错误:
Traceback (most recent call last):
File "finetune_cpm_bee.py", line 503, in
main()
File "finetune_cpm_bee.py", line 499, in main
finetune(args, tokenizer, model, optimizer, lr_scheduler, optim_manager)
File "finetune_cpm_bee.py", line 364, in finetune
optim_manager.step()
File "/ms_test2/miniconda3/envs/ms1.11/lib/python3.7/site-packages/bmtrain/optim/optim_manager.py", line 131, in step
optimizer.step(scale=self.loss_scale)
File "/ms_test2/miniconda3/envs/ms1.11/lib/python3.7/site-packages/torch/optim/optimizer.py", line 109, in wrapper
return func(*args, **kwargs)
File "/ms_test2/miniconda3/envs/ms1.11/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/ms_test2/miniconda3/envs/ms1.11/lib/python3.7/site-packages/bmtrain/optim/adam_offload.py", line 77, in step
state["_grad_fp16"] = torch.empty(p.size(), dtype=torch.float16, pin_memory=True) # on host
RuntimeError: CUDA error: OS call failed or operation not supported on this OS
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

@YingLaiLin
Copy link
Author

YingLaiLin commented Jun 17, 2023

CPM 微调脚本训练,不开启 --use-delta, 并且设置配置文件中 half 为 false,则出现如下错误:
Traceback (most recent call last):
File "finetune_cpm_bee.py", line 503, in
main()
File "finetune_cpm_bee.py", line 499, in main
finetune(args, tokenizer, model, optimizer, lr_scheduler, optim_manager)
File "finetune_cpm_bee.py", line 352, in finetune
loss = loss_func(logits.view(-1, logits.size(-1)), targets.view(-1))
File "/ms_test2/miniconda3/envs/ms1.11/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/ms_test2/miniconda3/envs/ms1.11/lib/python3.7/site-packages/bmtrain/loss/cross_entropy.py", line 192, in forward
ret = OpFusedCrossEntropy.apply(input, target.int(), self.ignore_index) # return float tensor
File "/ms_test2/miniconda3/envs/ms1.11/lib/python3.7/site-packages/bmtrain/loss/cross_entropy.py", line 18, in forward
ignore_index,
RuntimeError: input.dtype() == torch::kHalf INTERNAL ASSERT FAILED at "/tmp/pip-install-clhfk_l1/bmtrain_fe0a61bb02844d4b85067c24e12d4e87/csrc/cross_entropy_loss.cpp":25, please report a bug to PyTorch. input must be a half tensor

并且无法 /tmp 目录下找到该文件。

@YingLaiLin
Copy link
Author

这个是否是由于开启 cpu offload 导致的, bmtrain 是否有相关开关控制呢?

@gongbaitao
Copy link
Collaborator

您好,这是由于此前loss_func算子仅支持半精度,现已修复

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants