Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: "non finite loss" while running make_kaggle_solution.sh #9

Open
chintak opened this issue Jan 13, 2016 · 17 comments
Open

Comments

@chintak
Copy link

chintak commented Jan 13, 2016

While running python train_nn.py --cnf configs/c_128_5x5_32.py, I got the ValueError. The full error log is attached below. Even after installing lasagne and nolearn at the given commit ids, I'm still getting the deprecation warnings. Could this error be related to it?

Error log

@sveitser
Copy link
Owner

If you reduce the learning rate a bit, say to: 'schedule': {0: 0.002, 150: 0.0002, 201: 'stop'} do you still get a non-finite loss?

@chintak
Copy link
Author

chintak commented Jan 13, 2016

Yes, still the same error. Tried with 'schedule': { 0: 0.0005, 150: 0.00005, 201: 'stop', } as well.

@chintak
Copy link
Author

chintak commented Jan 13, 2016

Also while running the previous convert.py commands there were quite a few files for which "box could not be found" or "box too small" were outputted. In such cases, what is the output image written? I'm wondering if there is a corrupt input image which is causing this. Alternately, any other intermediate values I can print out to debug?

@sveitser
Copy link
Owner

That is expected as some images are almost totally black. It just falls back to cropping the center square in those cases.

During the competition I noticed that sometimes the computations slowed down, especially when switching between different versions of theano and I would sometimes have to clear the theano cache and/or reboot my computer. So I would try cleaning the theano cache by running theano-cache clear or by running rm -r ~/.theano and then try again.

If the problem persists, could you also post the output of pip list and pip freeze here?

@chintak
Copy link
Author

chintak commented Jan 14, 2016

/home/ubuntu/dataset/kaggle_diabetic/solution/src/lasagne-master/lasagne/init.py:86: 
UserWarning: The uniform initializer no longer uses Glorot et al.'s approach
to determine the bounds, but defaults to the range (-0.01, 0.01) instead. 
Please use the new GlorotUniform initializer to get the old behavior. 
GlorotUniform is now the default for all layers.

Could this warning be of consequence? Due to improper parameter initialization, we are getting infinite loss.

@sveitser
Copy link
Owner

That should be fine. I get these warnings too but we are using the orthogonal initialization.
https://github.com/sveitser/kaggle_diabetic/blob/master/layers.py#L36-L37

@chintak
Copy link
Author

chintak commented Jan 14, 2016

Ok.

Output for pip list:

click (3.3)
decorator (4.0.6)
funcsigs (0.4)
ghalton (0.6)
joblib (0.9.3)
Lasagne (0.1.dev0, /home/ubuntu/dataset/kaggle_diabetic/solution/src/lasagne-master)
matplotlib (1.4.3)
mock (1.3.0)
networkx (1.10)
nolearn (0.6a0.dev0, /home/ubuntu/dataset/kaggle_diabetic/solution/src/nolearn-master)
nose (1.3.7)
numpy (1.9.2)
pandas (0.16.0)
pbr (1.8.1)
Pillow (2.7.0)
pip (7.1.2)
pyparsing (2.0.7)
python-dateutil (2.4.2)
pytz (2015.7)
PyYAML (3.11)
scikit-image (0.11.3)
scikit-learn (0.16.1)
scipy (0.15.1)
setuptools (18.2)
SharedArray (0.3)
six (1.10.0)
tabulate (0.7.5)
Theano (0.7.0, /home/ubuntu/dataset/kaggle_diabetic/solution/src/theano)
wheel (0.24.0)

Output for pip freeze:

click==3.3
decorator==4.0.6
funcsigs==0.4
ghalton==0.6
joblib==0.9.3
-e git+https://github.com/benanne/Lasagne.git@9f591a5f3a192028df9947ba1e4903b3b46e8fe0#egg=Lasagne-dev
matplotlib==1.4.3
mock==1.3.0
networkx==1.10
-e git+https://github.com/dnouri/nolearn.git@0a225bc5ad60c76cdc6cccbe866f9b2e39502d10#egg=nolearn-dev
nose==1.3.7
numpy==1.9.2
pandas==0.16.0
pbr==1.8.1
Pillow==2.7.0
pyparsing==2.0.7
python-dateutil==2.4.2
pytz==2015.7
PyYAML==3.11
scikit-image==0.11.3
scikit-learn==0.16.1
scipy==0.15.1
SharedArray==0.3
six==1.10.0
tabulate==0.7.5
-e git+https://github.com/Theano/Theano.git@71a3700fcefd8589728b2b91931debad14c38a3f#egg=Theano-master
wheel==0.24.0

@chintak
Copy link
Author

chintak commented Jan 14, 2016

Any other changes I can try?

@sveitser
Copy link
Owner

I don't have any good ideas right now. What version of cuDNN are you using?

You could try this theano commit instead.

pip install --upgrade -e git+https://github.com/Theano/Theano.git@dfb2730348d05f6aadd116ce492e836a4c0ba6d6#egg=Theano-master

I think it's the one I was using when I was working on the project. Probably best to delete the theano cache again before retrying with another theano version.

@chintak
Copy link
Author

chintak commented Jan 14, 2016

AWS G2 instance which has GRID K520 GPU with CUDA 7.0 and cuDNN v3.0. Nope, the problem still persists.

@sveitser
Copy link
Owner

You could insert

print(batch_train_loss[0])

right before this line https://github.com/sveitser/kaggle_diabetic/blob/master/nn.py#L248 to check if the non-finite loss occurs after the initial batch or if it's first finite and then diverges.

@chintak
Copy link
Author

chintak commented Jan 14, 2016

Yep, I had tried that in the beginning. It's "nan" for the first batch itself.

@sveitser
Copy link
Owner

Have you tried using any other configurations?

@chintak
Copy link
Author

chintak commented Jan 16, 2016

Yes, I did a fresh install, preprocessed the images again and then ran the train_nn.py for all the given config files - I get "non finite loss" in the very first epoch. I also tried using a batch size of 1. Even then the loss is "nan". Something fundamentally seems to be wrong. Any unit tests for checking lasagne or nolearn? I'm more of a caffe person.

@sveitser
Copy link
Owner

Yes both have tests and theano does as well.

For theano (assuming you installed theano with pip previously)

git clone https://github.com/Theano/Theano
cd Theano
theano-nose

For lasagne,

git clone https://github.com/Lasagne/Lasagne
cd Lasagne
pip install -r requirements-dev.txt # comment out first line to not install another theano commit
py.test

For nolearn,

git clone https://github.com/dnouri/nolearn
cd nolearn
py.test

For more info, https://lasagne.readthedocs.org/en/latest/user/development.html#how-to-contribute and
http://deeplearning.net/software/theano/extending/unittest.html .

@sveitser
Copy link
Owner

@chintak Just out of curiosity. Did you manage to get things to work or find out what is going wrong?

@chintak
Copy link
Author

chintak commented Jan 21, 2016

Nope. After a few days I'll try and test it on another system. 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants