Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training loss NaN #38

Open
FelixMessi opened this issue Sep 11, 2024 · 1 comment
Open

Training loss NaN #38

FelixMessi opened this issue Sep 11, 2024 · 1 comment

Comments

@FelixMessi
Copy link

Hi, thanks for the amazing work!

I've encountered an issue with NaN losses when using AMP in MambaVision, particularly when I reduce the training epochs to 30. The problem seems to stem from the selective_scan_fn function. I've tried switching to float32 for training, which resolves the NaN issue, but this approach is more resource-intensive compared to using AMP. Could anyone suggest more flexible solutions?

@ahatamiz
Copy link
Collaborator

Hi @FelixMessi

It's hard to exactly pinpoint the issue by only knowing that total number of epochs have been reduced. However, my best bet would be to decrease the learning rate but yet keep AMP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants