[Bug] 训练bf16 infer fp16出现NaN #278

Cerberous · 2024-07-16T09:18:03Z

Describe the bug

我来重新描述一下我的问题，我在用internevo训练的时候用的bf16，然后转换成hf后用fp16推理遇到了下述报错

Traceback (most recent call last):
  File "/InternLM/hf_test.py", line 15, in <module>
    output = model.generate(**inputs, **gen_kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 1592, in generate
    return self.sample(
  File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 2734, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

这个错误是由于model load进来的时候torch_dtype的，如果我改成torch_type=torch.bfloat16或者torch.float32都是没有问题的，但是torch.float16会存在这个问题，我自己的理解是训练用bf16，推理用fp16本身就存在一定的精度误差，指数位bf16是高于fp16的，最后比如计算attention的matrix multiply时会导致这个错误，但是我看到internlm官方的代码也是用torch.float16，所以想请教下这个问题

Environment

官方镜像

Other information

No response

The text was updated successfully, but these errors were encountered:

Cerberous added the bug Something isn't working label Jul 16, 2024

mm-assistant bot assigned yhcc Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] 训练bf16 infer fp16出现NaN #278

[Bug] 训练bf16 infer fp16出现NaN #278

Cerberous commented Jul 16, 2024 •

edited

Loading

[Bug] 训练bf16 infer fp16出现NaN #278

[Bug] 训练bf16 infer fp16出现NaN #278

Comments

Cerberous commented Jul 16, 2024 • edited Loading

Describe the bug

Environment

Other information

Cerberous commented Jul 16, 2024 •

edited

Loading