Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Device does not support bfloat16 #15

Open
xlzhou01 opened this issue Oct 18, 2024 · 3 comments
Open

CUDA Device does not support bfloat16 #15

xlzhou01 opened this issue Oct 18, 2024 · 3 comments

Comments

@xlzhou01
Copy link

File "/home/.conda/envs/spmba/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 234, in init
raise RuntimeError('Current CUDA Device does not support bfloat16. Please switch dtype to float16.')
RuntimeError: Current CUDA Device does not support bfloat16. Please switch dtype to float16.

@JusperLee
Copy link
Owner

It looks like you're encountering a compatibility issue with bfloat16 on your current CUDA device. As the error suggests, switching the data type to float16 should resolve this problem. If you're using Mamba, make sure your environment is set up to support the necessary CUDA features. If you have further questions or need assistance, feel free to reach out!

@xlzhou01
Copy link
Author

xlzhou01 commented Oct 19, 2024

I set it to float16 (precision="16-mixed"), and then the following error occurred:

Monitored metric val_loss/dataloader_idx_0 = nan is not finite. Previous best value was inf. Signaling Trainer to stop.
Epoch 0, global step 13900: 'val_loss/dataloader_idx_0' reached inf (best inf), saving model to '/data/SPMamba/Experiments/checkpoint/SPMamba-Libri2Mix/epoch=0.ckpt' as top 5

I use the noisy Libri2mix sub-dataset.
And I try again:
image

I also tried running the clean sub-dataset of Libri2Mix later, and I encountered the same issue. Could it be related to the modified precision?

@JusperLee
Copy link
Owner

You might need to adjust the value of 'eps' used in the paper to match the precision you are working with. When using float16 (precision='16-mixed'), the limited numerical range can sometimes lead to instability, such as NaNs or Infs in the loss. Consider increasing 'eps' slightly to maintain numerical stability during training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants