Part 2 Chapter 12 issue #84

jsgrover · 2021-12-29T02:19:47Z

When I run p2ch12.train --epochs 10 on Centos 8 with nvidia RTX card..
I get the following error from tensorflow.. see last two outputs..below

[***********] dlwpt-code]$ python3.9 -m p2ch12.training --epochs 10
2021-12-28 17:20:18,410 INFO pid:970515 main:127:initModel Using CUDA; 1 devices.
2021-12-28 17:20:21,214 INFO pid:970515 main:188:main Starting LunaTrainingApp, Namespace(batch_size=32, num_workers=8, epochs=10, balanced=False, augmented=False, augment_flip=False, augment_offset=False, augment_scale=False, augment_rotate=False, augment_noise=False, tb_prefix='p2ch12', comment='dlwpt')
2021-12-28 17:20:23,860 INFO pid:970515 p2ch12.dsets:266:init <p2ch12.dsets.LunaDataset object at 0x7f3f44b4eee0>: 51244 training samples, 51135 neg, 109 pos, unbalanced ratio
2021-12-28 17:20:23,864 INFO pid:970515 p2ch12.dsets:266:init <p2ch12.dsets.LunaDataset object at 0x7f3f44b4ef40>: 5694 validation samples, 5681 neg, 13 pos, unbalanced ratio
2021-12-28 17:20:23,865 INFO pid:970515 main:195:main Epoch 1 of 10, 1602/178 batches of size 32*1
2021-12-28 17:20:23,865 WARNING pid:970515 util.util:219:enumerateWithEstimate E1 Training ----/1602, starting
2021-12-28 17:22:10,310 INFO pid:970515 util.util:236:enumerateWithEstimate E1 Training 64/1602, done at 2021-12-28 18:03:39, 0:43:01
2021-12-28 17:26:46,141 INFO pid:970515 util.util:236:enumerateWithEstimate E1 Training 256/1602, done at 2021-12-28 17:59:54, 0:39:16
2021-12-28 17:45:02,250 INFO pid:970515 util.util:236:enumerateWithEstimate E1 Training 1024/1602, done at 2021-12-28 17:58:53, 0:38:15

2021-12-28 17:58:23,593 WARNING pid:970515 util.util:249:enumerateWithEstimate E1 Training ----/1602, done at 2021-12-28 17:58:23
2021-12-28 17:58:24.056498: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-12-28 17:58:24.056539: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

sfleisch · 2021-12-30T04:33:03Z

Not exactly sure how you get a TensorFlow error with PyTorch but I'm guessing that's from tensorboard?
As far as not finding the libcudart.so.11.0 , please check your CUDA version. If you are running in a container it would be in /usr/local. You should see /usr/local/cuda-11.0 and a symbolic link to it from /usr/local/cuda. If bare-metal that would depend on how you installed CUDA. Often times if you install a framework (oops PyTorch isn't a framework, it's a library...) from a pre-built binary you can get conflicts if you don't have the CUDA version that the binary was built with.

jsgrover changed the title ~~Part 2 Chapter 11,12 issue~~ Part 2 Chapter 12 issue Dec 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Part 2 Chapter 12 issue #84

Part 2 Chapter 12 issue #84

jsgrover commented Dec 29, 2021

sfleisch commented Dec 30, 2021

Part 2 Chapter 12 issue #84

Part 2 Chapter 12 issue #84

Comments

jsgrover commented Dec 29, 2021

sfleisch commented Dec 30, 2021