-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
loss=nan #5
Comments
We've recently been informed of the same problem. In most cases, it is because of training using AMP with FP16. Could you please change the following line |
Thanks for your reply. but i didn't use the accelerate and the config.yaml file. should I must use it? |
Yes, it is recommended to use Hugging Face Accelerator with our code. In case you do not want to utilize it, please take a look at the current settings of Hugging Face Accelerator by following this tutorial https://huggingface.co/docs/accelerate/en/package_reference/cli#accelerate-config. Please make sure that FP16 is not being used for your training. |
I used it but there is a error. but I have already installed torchvision. |
I am not sure about this but maybe a clean installation of the Anaconda environment might help. |
Is it related to this problem? |
|
I meant the problem with |
It looks fine to me. Could you please identify which loss function causes the error? |
Sorry about this inconvenience. To be honest, we have no clue why this happens. Could you please follow this post for error traceback? Thank you. |
|
Have you solved the problem? Based on recent comments we received, loss can be NaN when fp16 is used. So could you please double-check the datatype again by following this post? Thank you. |
I print the type and all are float32.Unfortunately I have not solved this problem. |
Thanks for your eminent work.But it seems that I met a problem.I don't know why the loss is nan.Although I added Normalization, it was useless. Sincerely thank you for your help!
The text was updated successfully, but these errors were encountered: