Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDNN_STATUS_INTERNAL_ERROR during backward #10

Closed
baibaidj opened this issue Jun 26, 2021 · 6 comments
Closed

CUDNN_STATUS_INTERNAL_ERROR during backward #10

baibaidj opened this issue Jun 26, 2021 · 6 comments
Labels
Needs Additional Info Additional information is needed to reproduce/investigate the Issue question Default label

Comments

@baibaidj
Copy link

baibaidj commented Jun 26, 2021

Hi. This is an awesome repo.
I tried to run training on RibFrac using nnDet and run into the following issue:
The training run smoothly for 6 epochs, and suddenly broke and the cudnn error was thrown.
So I rebooted the machine and tried to rerun but got the same error like

'
File "/home/whos/miniconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 499, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/home/whos/miniconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 738, in run_training_batch
self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
File "/home/whos/miniconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 434, in optimizer_step
model_ref.optimizer_step(
File "/home/whos/miniconda3/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1403, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/whos/miniconda3/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 214, in step
self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
File "/home/whos/miniconda3/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 134, in __optimizer_step
trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
File "/home/whos/miniconda3/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 325, in optimizer_step
make_optimizer_step = self.precision_plugin.pre_optimizer_step(
File "/home/whose/miniconda3/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/native_amp.py", line 93, in pre_optimizer_step
result = lambda_closure()
File "/home/whos/miniconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 732, in train_step_and_backward_closure
result = self.training_step_and_backward(
File "/home/whos/miniconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 836, in training_step_and_backward
self.backward(result, optimizer, opt_idx)
File "/home/whos/miniconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 869, in backward
result.closure_loss = self.trainer.accelerator.backward(
File "/home/whos/miniconda3/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 308, in backward
output = self.precision_plugin.backward(
File "/home/whos/miniconda3/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/native_amp.py", line 62, in backward
closure_loss = super().backward(model, closure_loss, optimizer, opt_idx, should_accumulate, *args, **kwargs)
File "/home/whos/miniconda3/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 79, in backward
model.backward(closure_loss, optimizer, opt_idx)
File "/home/whos/miniconda3/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1275, in backward
loss.backward(*args, **kwargs)
File "/home/whos/miniconda3/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/whos/miniconda3/lib/python3.8/site-packages/torch/autograd/init.py", line 130, in backward
Variable._execution_engine.run_backward(
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([4, 32, 112, 192, 160], dtype=torch.half, device='cuda', requires_grad=True)
net = torch.nn.Conv3d(32, 32, kernel_size=[3, 3, 3], padding=[1, 1, 1], stride=[1, 1, 1], dilation=[1, 1, 1], groups=1)
net = net.cuda().half()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams
data_type = CUDNN_DATA_HALF
padding = [1, 1, 1]
stride = [1, 1, 1]
dilation = [1, 1, 1]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0x7fc3d00a8de0
type = CUDNN_DATA_HALF
nbDims = 5
dimA = 4, 32, 112, 192, 160,
strideA = 110100480, 3440640, 30720, 160, 1,
output: TensorDescriptor 0x7fc3d00aa920
type = CUDNN_DATA_HALF
nbDims = 5
dimA = 4, 32, 112, 192, 160,
strideA = 110100480, 3440640, 30720, 160, 1,
weight: FilterDescriptor 0x7fc3d00ab420
type = CUDNN_DATA_HALF
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 5
dimA = 32, 32, 3, 3, 3,
Pointer addresses:
input: 0x7fc5de000000
output: 0x7fc614000000
weight: 0x7fc4597c9c00
Additional pointer addresses:
grad_output: 0x7fc614000000
grad_input: 0x7fc5de000000
Backward data algorithm: 3

ConvolutionParams
data_type = CUDNN_DATA_HALF
padding = [1, 1, 1]
stride = [1, 1, 1]
dilation = [1, 1, 1]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0x7fc3d00a8de0
type = CUDNN_DATA_HALF
nbDims = 5
dimA = 4, 32, 112, 192, 160,
strideA = 110100480, 3440640, 30720, 160, 1,
output: TensorDescriptor 0x7fc3d00aa920
type = CUDNN_DATA_HALF
nbDims = 5
dimA = 4, 32, 112, 192, 160,
strideA = 110100480, 3440640, 30720, 160, 1,
weight: FilterDescriptor 0x7fc3d00ab420
type = CUDNN_DATA_HALF
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 5
dimA = 32, 32, 3, 3, 3,
Pointer addresses:
input: 0x7fc5de000000
output: 0x7fc614000000
weight: 0x7fc4597c9c00
Additional pointer addresses:
grad_output: 0x7fc614000000
grad_input: 0x7fc5de000000
Backward data algorithm: 3

Exception ignored in: <function tqdm.del at 0x7fc77f90a3a0>
Traceback (most recent call last):
File "/home/whos/miniconda3/lib/python3.8/site-packages/tqdm/std.py", line 1128, in del
File "/home/whos/miniconda3/lib/python3.8/site-packages/tqdm/std.py", line 1341, in close
File "/home/whos/miniconda3/lib/python3.8/site-packages/tqdm/std.py", line 1520, in display
File "/home/whos/miniconda3/lib/python3.8/site-packages/tqdm/std.py", line 1131, in repr
File "/home/whos/miniconda3/lib/python3.8/site-packages/tqdm/std.py", line 1481, in format_dict
TypeError: cannot unpack non-iterable NoneType object
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fc78556f8b2 in /home/whos/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7fc7857c1952 in /home/whos/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fc78555ab7d in /home/whos/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #3: std::vector<c10d::Reducer::Bucket, std::allocatorc10d::Reducer::Bucket >::~vector() + 0x312 (0x7fc821161912 in /home/whos/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: c10d::Reducer::~Reducer() + 0x342 (0x7fc8211601f2 in /home/whos/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7fc8211340a2 in /home/whos/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7fc820b2a9f6 in /home/whos/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: + 0x8c1e4f (0x7fc821135e4f in /home/whos/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: + 0x2c2c60 (0x7fc820b36c60 in /home/whos/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: + 0x2c3dce (0x7fc820b37dce in /home/whos/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: + 0x1289f5 (0x55f6270d59f5 in /home/whos/miniconda3/bin/python)
frame #11: + 0x1c0004 (0x55f62716d004 in /home/whos/miniconda3/bin/python)
frame #12: + 0x128926 (0x55f6270d5926 in /home/whos/miniconda3/bin/python)
frame #13: + 0x1c0004 (0x55f62716d004 in /home/whos/miniconda3/bin/python)
frame #14: + 0x128746 (0x55f6270d5746 in /home/whos/miniconda3/bin/python)
frame #15: + 0x1c0004 (0x55f62716d004 in /home/whos/miniconda3/bin/python)
frame #16: + 0x1289f5 (0x55f6270d59f5 in /home/whos/miniconda3/bin/python)
frame #17: + 0x1c0004 (0x55f62716d004 in /home/whos/miniconda3/bin/python)
frame #18: + 0x128a2a (0x55f6270d5a2a in /home/whos/miniconda3/bin/python)
frame #19: + 0x11d332 (0x55f6270ca332 in /home/whos/miniconda3/bin/python)
frame #20: + 0x13c255 (0x55f6270e9255 in /home/whos/miniconda3/bin/python)
frame #21: _PyGC_CollectNoFail + 0x2a (0x55f6271e285a in /home/whos/miniconda3/bin/python)
frame #22: PyImport_Cleanup + 0x295 (0x55f6271f94d5 in /home/whos/miniconda3/bin/python)
frame #23: Py_FinalizeEx + 0x7d (0x55f6271f968d in /home/whos/miniconda3/bin/python)
frame #24: Py_RunMain + 0x110 (0x55f6271fbb90 in /home/whos/miniconda3/bin/python)
frame #25: Py_BytesMain + 0x39 (0x55f6271fbd19 in /home/whos/miniconda3/bin/python)
frame #26: __libc_start_main + 0xf0 (0x7fc8239eb840 in /lib/x86_64-linux-gnu/libc.so.6)
frame #27: + 0x1dee93 (0x55f62718be93 in /home/whos/miniconda3/bin/python)
'

When I run the code snippet as suggested, it is fine.
I used a server with 250G memory and some V100 GPUs.
Could you please help to locate where the problem is?
Thank you so much.

@mibaumgartner mibaumgartner added the question Default label label Jun 28, 2021
@mibaumgartner
Copy link
Collaborator

Hi @baibaidj ,

I ran nnDet on many of our nodes (including V100 nodes) and it worked without any problems. Can you run nndet_env to collect you environment and alro report your current pytorch lightning version?
I'll try your code snippet later today, but if that already triggers an error there might be some other problem which is not nnDet related.

Best,
Michael

@mibaumgartner
Copy link
Collaborator

I tried the provided code snippet on multiple workstations without encountering any error.

@mibaumgartner mibaumgartner added the Needs Additional Info Additional information is needed to reproduce/investigate the Issue label Jul 5, 2021
@baibaidj
Copy link
Author

baibaidj commented Jul 6, 2021

Hi @baibaidj ,

I ran nnDet on many of our nodes (including V100 nodes) and it worked without any problems. Can you run nndet_env to collect you environment and alro report your current pytorch lightning version?
I'll try your code snippet later today, but if that already triggers an error there might be some other problem which is not nnDet related.

Best,
Michael

Thank you for the reply. My environment is as follows provided by nndet_env:

'''
----- PyTorch Information -----
PyTorch Version: 1.7.1+cu110
PyTorch Debug: False
PyTorch CUDA: 11.0
PyTorch Backend cudnn: 8005
PyTorch CUDA Arch List: ['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80']
PyTorch Current Device Capability: (7, 0)
PyTorch CUDA available: True

----- System Information -----
System NVCC: nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Tue_Sep_15_19:10:02_PDT_2020
Cuda compilation tools, release 11.1, V11.1.74
Build cuda_11.1.TC455_06.29069683_0

System Arch List: None
System OMP_NUM_THREADS: 1
System CUDA_HOME is None: True
System CPU Count: 80
Python Version: 3.8.5 (default, Sep 4 2020, 07:30:14)
[GCC 7.3.0]

----- nnDetection Information -----
det_num_threads 6
det_data is set True
det_models is set True
'''

And the pytorch lightning on my machine is "Version: 1.3.7".

I can also run the code snippet without a problem.

I will try to run the training experiment for a little more times.

Thank you again.

@hushunda
Copy link

same problem.

@baibaidj
Copy link
Author

The problem did not occur when I run the code in a machine with a single GPU 3090. I do not think it is a common issue and suggest to close this thread. Thank you.

@mibaumgartner
Copy link
Collaborator

Thanks for the update @baibaidj !

It is extremely hard to debug/reproduce this problem since the error message does not provide much information. I never ran into that Issue even though we tested nnDetection on various configurations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Additional Info Additional information is needed to reproduce/investigate the Issue question Default label
Projects
None yet
Development

No branches or pull requests

3 participants