Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] 训练bf16 infer fp16出现NaN #278

Open
Cerberous opened this issue Jul 16, 2024 · 0 comments
Open

[Bug] 训练bf16 infer fp16出现NaN #278

Cerberous opened this issue Jul 16, 2024 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@Cerberous
Copy link

Cerberous commented Jul 16, 2024

Describe the bug

我来重新描述一下我的问题,我在用internevo训练的时候用的bf16,然后转换成hf后用fp16推理遇到了下述报错

Traceback (most recent call last):
  File "/InternLM/hf_test.py", line 15, in <module>
    output = model.generate(**inputs, **gen_kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 1592, in generate
    return self.sample(
  File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 2734, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

这个错误是由于model load进来的时候torch_dtype的,如果我改成torch_type=torch.bfloat16或者torch.float32都是没有问题的,但是torch.float16会存在这个问题,我自己的理解是训练用bf16,推理用fp16本身就存在一定的精度误差,指数位bf16是高于fp16的,最后比如计算attention的matrix multiply时会导致这个错误,但是我看到internlm官方的代码也是用torch.float16,所以想请教下这个问题

Environment

官方镜像

Other information

No response

@Cerberous Cerberous added the bug Something isn't working label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants