-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueError: "non finite loss" while running make_kaggle_solution.sh #9
Comments
If you reduce the learning rate a bit, say to: |
Yes, still the same error. Tried with |
Also while running the previous |
That is expected as some images are almost totally black. It just falls back to cropping the center square in those cases. During the competition I noticed that sometimes the computations slowed down, especially when switching between different versions of theano and I would sometimes have to clear the theano cache and/or reboot my computer. So I would try cleaning the theano cache by running If the problem persists, could you also post the output of |
Could this warning be of consequence? Due to improper parameter initialization, we are getting infinite loss. |
That should be fine. I get these warnings too but we are using the orthogonal initialization. |
Ok. Output for
Output for
|
Any other changes I can try? |
I don't have any good ideas right now. What version of cuDNN are you using? You could try this theano commit instead.
I think it's the one I was using when I was working on the project. Probably best to delete the theano cache again before retrying with another theano version. |
AWS G2 instance which has GRID K520 GPU with CUDA 7.0 and cuDNN v3.0. Nope, the problem still persists. |
You could insert print(batch_train_loss[0]) right before this line https://github.com/sveitser/kaggle_diabetic/blob/master/nn.py#L248 to check if the non-finite loss occurs after the initial batch or if it's first finite and then diverges. |
Yep, I had tried that in the beginning. It's "nan" for the first batch itself. |
Have you tried using any other configurations? |
Yes, I did a fresh install, preprocessed the images again and then ran the train_nn.py for all the given config files - I get "non finite loss" in the very first epoch. I also tried using a batch size of 1. Even then the loss is "nan". Something fundamentally seems to be wrong. Any unit tests for checking lasagne or nolearn? I'm more of a caffe person. |
Yes both have tests and theano does as well. For theano (assuming you installed theano with pip previously)
For lasagne,
For nolearn,
For more info, https://lasagne.readthedocs.org/en/latest/user/development.html#how-to-contribute and |
@chintak Just out of curiosity. Did you manage to get things to work or find out what is going wrong? |
Nope. After a few days I'll try and test it on another system. |
While running
python train_nn.py --cnf configs/c_128_5x5_32.py
, I got theValueError
. The full error log is attached below. Even after installing lasagne and nolearn at the given commit ids, I'm still getting the deprecation warnings. Could this error be related to it?Error log
The text was updated successfully, but these errors were encountered: