ValueError: "non finite loss" while running make_kaggle_solution.sh #9

chintak · 2016-01-13T14:17:42Z

While running python train_nn.py --cnf configs/c_128_5x5_32.py, I got the ValueError. The full error log is attached below. Even after installing lasagne and nolearn at the given commit ids, I'm still getting the deprecation warnings. Could this error be related to it?

Error log

The text was updated successfully, but these errors were encountered:

sveitser · 2016-01-13T14:33:52Z

If you reduce the learning rate a bit, say to: 'schedule': {0: 0.002, 150: 0.0002, 201: 'stop'} do you still get a non-finite loss?

chintak · 2016-01-13T17:42:40Z

Yes, still the same error. Tried with 'schedule': { 0: 0.0005, 150: 0.00005, 201: 'stop', } as well.

chintak · 2016-01-13T17:45:35Z

Also while running the previous convert.py commands there were quite a few files for which "box could not be found" or "box too small" were outputted. In such cases, what is the output image written? I'm wondering if there is a corrupt input image which is causing this. Alternately, any other intermediate values I can print out to debug?

sveitser · 2016-01-14T04:33:45Z

That is expected as some images are almost totally black. It just falls back to cropping the center square in those cases.

During the competition I noticed that sometimes the computations slowed down, especially when switching between different versions of theano and I would sometimes have to clear the theano cache and/or reboot my computer. So I would try cleaning the theano cache by running theano-cache clear or by running rm -r ~/.theano and then try again.

If the problem persists, could you also post the output of pip list and pip freeze here?

chintak · 2016-01-14T06:50:18Z

/home/ubuntu/dataset/kaggle_diabetic/solution/src/lasagne-master/lasagne/init.py:86: 
UserWarning: The uniform initializer no longer uses Glorot et al.'s approach
to determine the bounds, but defaults to the range (-0.01, 0.01) instead. 
Please use the new GlorotUniform initializer to get the old behavior. 
GlorotUniform is now the default for all layers.

Could this warning be of consequence? Due to improper parameter initialization, we are getting infinite loss.

sveitser · 2016-01-14T08:06:28Z

That should be fine. I get these warnings too but we are using the orthogonal initialization.
https://github.com/sveitser/kaggle_diabetic/blob/master/layers.py#L36-L37

chintak · 2016-01-14T08:36:50Z

Ok.

Output for pip list:

click (3.3)
decorator (4.0.6)
funcsigs (0.4)
ghalton (0.6)
joblib (0.9.3)
Lasagne (0.1.dev0, /home/ubuntu/dataset/kaggle_diabetic/solution/src/lasagne-master)
matplotlib (1.4.3)
mock (1.3.0)
networkx (1.10)
nolearn (0.6a0.dev0, /home/ubuntu/dataset/kaggle_diabetic/solution/src/nolearn-master)
nose (1.3.7)
numpy (1.9.2)
pandas (0.16.0)
pbr (1.8.1)
Pillow (2.7.0)
pip (7.1.2)
pyparsing (2.0.7)
python-dateutil (2.4.2)
pytz (2015.7)
PyYAML (3.11)
scikit-image (0.11.3)
scikit-learn (0.16.1)
scipy (0.15.1)
setuptools (18.2)
SharedArray (0.3)
six (1.10.0)
tabulate (0.7.5)
Theano (0.7.0, /home/ubuntu/dataset/kaggle_diabetic/solution/src/theano)
wheel (0.24.0)

Output for pip freeze:

click==3.3
decorator==4.0.6
funcsigs==0.4
ghalton==0.6
joblib==0.9.3
-e git+https://github.com/benanne/Lasagne.git@9f591a5f3a192028df9947ba1e4903b3b46e8fe0#egg=Lasagne-dev
matplotlib==1.4.3
mock==1.3.0
networkx==1.10
-e git+https://github.com/dnouri/nolearn.git@0a225bc5ad60c76cdc6cccbe866f9b2e39502d10#egg=nolearn-dev
nose==1.3.7
numpy==1.9.2
pandas==0.16.0
pbr==1.8.1
Pillow==2.7.0
pyparsing==2.0.7
python-dateutil==2.4.2
pytz==2015.7
PyYAML==3.11
scikit-image==0.11.3
scikit-learn==0.16.1
scipy==0.15.1
SharedArray==0.3
six==1.10.0
tabulate==0.7.5
-e git+https://github.com/Theano/Theano.git@71a3700fcefd8589728b2b91931debad14c38a3f#egg=Theano-master
wheel==0.24.0

chintak · 2016-01-14T08:45:03Z

Any other changes I can try?

sveitser · 2016-01-14T08:54:59Z

I don't have any good ideas right now. What version of cuDNN are you using?

You could try this theano commit instead.

pip install --upgrade -e git+https://github.com/Theano/Theano.git@dfb2730348d05f6aadd116ce492e836a4c0ba6d6#egg=Theano-master

I think it's the one I was using when I was working on the project. Probably best to delete the theano cache again before retrying with another theano version.

chintak · 2016-01-14T09:23:18Z

AWS G2 instance which has GRID K520 GPU with CUDA 7.0 and cuDNN v3.0. Nope, the problem still persists.

sveitser · 2016-01-14T09:42:49Z

You could insert

print(batch_train_loss[0])

right before this line https://github.com/sveitser/kaggle_diabetic/blob/master/nn.py#L248 to check if the non-finite loss occurs after the initial batch or if it's first finite and then diverges.

chintak · 2016-01-14T09:44:55Z

Yep, I had tried that in the beginning. It's "nan" for the first batch itself.

sveitser · 2016-01-15T02:18:18Z

Have you tried using any other configurations?

chintak · 2016-01-16T07:40:42Z

Yes, I did a fresh install, preprocessed the images again and then ran the train_nn.py for all the given config files - I get "non finite loss" in the very first epoch. I also tried using a batch size of 1. Even then the loss is "nan". Something fundamentally seems to be wrong. Any unit tests for checking lasagne or nolearn? I'm more of a caffe person.

sveitser · 2016-01-16T08:12:49Z

Yes both have tests and theano does as well.

For theano (assuming you installed theano with pip previously)

git clone https://github.com/Theano/Theano
cd Theano
theano-nose

For lasagne,

git clone https://github.com/Lasagne/Lasagne
cd Lasagne
pip install -r requirements-dev.txt # comment out first line to not install another theano commit
py.test

For nolearn,

git clone https://github.com/dnouri/nolearn
cd nolearn
py.test

For more info, https://lasagne.readthedocs.org/en/latest/user/development.html#how-to-contribute and
http://deeplearning.net/software/theano/extending/unittest.html .

sveitser · 2016-01-21T04:19:02Z

@chintak Just out of curiosity. Did you manage to get things to work or find out what is going wrong?

chintak · 2016-01-21T05:57:48Z

Nope. After a few days I'll try and test it on another system.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: "non finite loss" while running make_kaggle_solution.sh #9

ValueError: "non finite loss" while running make_kaggle_solution.sh #9

chintak commented Jan 13, 2016

sveitser commented Jan 13, 2016

chintak commented Jan 13, 2016

chintak commented Jan 13, 2016

sveitser commented Jan 14, 2016

chintak commented Jan 14, 2016

sveitser commented Jan 14, 2016

chintak commented Jan 14, 2016

chintak commented Jan 14, 2016

sveitser commented Jan 14, 2016

chintak commented Jan 14, 2016

sveitser commented Jan 14, 2016

chintak commented Jan 14, 2016

sveitser commented Jan 15, 2016

chintak commented Jan 16, 2016

sveitser commented Jan 16, 2016

sveitser commented Jan 21, 2016

chintak commented Jan 21, 2016

ValueError: "non finite loss" while running make_kaggle_solution.sh #9

ValueError: "non finite loss" while running make_kaggle_solution.sh #9

Comments

chintak commented Jan 13, 2016

sveitser commented Jan 13, 2016

chintak commented Jan 13, 2016

chintak commented Jan 13, 2016

sveitser commented Jan 14, 2016

chintak commented Jan 14, 2016

sveitser commented Jan 14, 2016

chintak commented Jan 14, 2016

chintak commented Jan 14, 2016

sveitser commented Jan 14, 2016

chintak commented Jan 14, 2016

sveitser commented Jan 14, 2016

chintak commented Jan 14, 2016

sveitser commented Jan 15, 2016

chintak commented Jan 16, 2016

sveitser commented Jan 16, 2016

sveitser commented Jan 21, 2016

chintak commented Jan 21, 2016