CPM Bee 微调时设置 half 出现 CUDA 报错，不设置 half 则 assert 报错 #79

YingLaiLin · 2023-06-17T16:54:32Z

CPM 使用微调脚本训练，不开启 --use-delta 这一选项，则出现如下错误：
Traceback (most recent call last):
File "finetune_cpm_bee.py", line 503, in
main()
File "finetune_cpm_bee.py", line 499, in main
finetune(args, tokenizer, model, optimizer, lr_scheduler, optim_manager)
File "finetune_cpm_bee.py", line 364, in finetune
optim_manager.step()
File "/ms_test2/miniconda3/envs/ms1.11/lib/python3.7/site-packages/bmtrain/optim/optim_manager.py", line 131, in step
optimizer.step(scale=self.loss_scale)
File "/ms_test2/miniconda3/envs/ms1.11/lib/python3.7/site-packages/torch/optim/optimizer.py", line 109, in wrapper
return func(*args, **kwargs)
File "/ms_test2/miniconda3/envs/ms1.11/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/ms_test2/miniconda3/envs/ms1.11/lib/python3.7/site-packages/bmtrain/optim/adam_offload.py", line 77, in step
state["_grad_fp16"] = torch.empty(p.size(), dtype=torch.float16, pin_memory=True) # on host
RuntimeError: CUDA error: OS call failed or operation not supported on this OS
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

YingLaiLin · 2023-06-17T16:56:13Z

CPM 微调脚本训练，不开启 --use-delta，并且设置配置文件中 half 为 false，则出现如下错误：
Traceback (most recent call last):
File "finetune_cpm_bee.py", line 503, in
main()
File "finetune_cpm_bee.py", line 499, in main
finetune(args, tokenizer, model, optimizer, lr_scheduler, optim_manager)
File "finetune_cpm_bee.py", line 352, in finetune
loss = loss_func(logits.view(-1, logits.size(-1)), targets.view(-1))
File "/ms_test2/miniconda3/envs/ms1.11/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/ms_test2/miniconda3/envs/ms1.11/lib/python3.7/site-packages/bmtrain/loss/cross_entropy.py", line 192, in forward
ret = OpFusedCrossEntropy.apply(input, target.int(), self.ignore_index) # return float tensor
File "/ms_test2/miniconda3/envs/ms1.11/lib/python3.7/site-packages/bmtrain/loss/cross_entropy.py", line 18, in forward
ignore_index,
RuntimeError: input.dtype() == torch::kHalf INTERNAL ASSERT FAILED at "/tmp/pip-install-clhfk_l1/bmtrain_fe0a61bb02844d4b85067c24e12d4e87/csrc/cross_entropy_loss.cpp":25, please report a bug to PyTorch. input must be a half tensor

并且无法 /tmp 目录下找到该文件。

YingLaiLin · 2023-06-18T03:15:00Z

这个是否是由于开启 cpu offload 导致的， bmtrain 是否有相关开关控制呢？

gongbaitao · 2023-06-19T07:21:25Z

您好，这是由于此前loss_func算子仅支持半精度，现已修复

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPM Bee 微调时设置 half 出现 CUDA 报错，不设置 half 则 assert 报错 #79

CPM Bee 微调时设置 half 出现 CUDA 报错，不设置 half 则 assert 报错 #79

YingLaiLin commented Jun 17, 2023

YingLaiLin commented Jun 17, 2023 •

edited

Loading

YingLaiLin commented Jun 18, 2023

gongbaitao commented Jun 19, 2023

CPM Bee 微调时设置 half 出现 CUDA 报错，不设置 half 则 assert 报错 #79

CPM Bee 微调时设置 half 出现 CUDA 报错，不设置 half 则 assert 报错 #79

Comments

YingLaiLin commented Jun 17, 2023

YingLaiLin commented Jun 17, 2023 • edited Loading

YingLaiLin commented Jun 18, 2023

gongbaitao commented Jun 19, 2023

YingLaiLin commented Jun 17, 2023 •

edited

Loading