-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDNN_STATUS_INTERNAL_ERROR during backward #10
Comments
Hi @baibaidj , I ran nnDet on many of our nodes (including V100 nodes) and it worked without any problems. Can you run Best, |
I tried the provided code snippet on multiple workstations without encountering any error. |
Thank you for the reply. My environment is as follows provided by nndet_env: ''' ----- System Information ----- System Arch List: None ----- nnDetection Information ----- And the pytorch lightning on my machine is "Version: 1.3.7". I can also run the code snippet without a problem. I will try to run the training experiment for a little more times. Thank you again. |
same problem. |
The problem did not occur when I run the code in a machine with a single GPU 3090. I do not think it is a common issue and suggest to close this thread. Thank you. |
Thanks for the update @baibaidj ! It is extremely hard to debug/reproduce this problem since the error message does not provide much information. I never ran into that Issue even though we tested nnDetection on various configurations. |
Hi. This is an awesome repo.
I tried to run training on RibFrac using nnDet and run into the following issue:
The training run smoothly for 6 epochs, and suddenly broke and the cudnn error was thrown.
So I rebooted the machine and tried to rerun but got the same error like
'
File "/home/whos/miniconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 499, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/home/whos/miniconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 738, in run_training_batch
self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
File "/home/whos/miniconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 434, in optimizer_step
model_ref.optimizer_step(
File "/home/whos/miniconda3/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1403, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/whos/miniconda3/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 214, in step
self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
File "/home/whos/miniconda3/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 134, in __optimizer_step
trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
File "/home/whos/miniconda3/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 325, in optimizer_step
make_optimizer_step = self.precision_plugin.pre_optimizer_step(
File "/home/whose/miniconda3/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/native_amp.py", line 93, in pre_optimizer_step
result = lambda_closure()
File "/home/whos/miniconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 732, in train_step_and_backward_closure
result = self.training_step_and_backward(
File "/home/whos/miniconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 836, in training_step_and_backward
self.backward(result, optimizer, opt_idx)
File "/home/whos/miniconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 869, in backward
result.closure_loss = self.trainer.accelerator.backward(
File "/home/whos/miniconda3/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 308, in backward
output = self.precision_plugin.backward(
File "/home/whos/miniconda3/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/native_amp.py", line 62, in backward
closure_loss = super().backward(model, closure_loss, optimizer, opt_idx, should_accumulate, *args, **kwargs)
File "/home/whos/miniconda3/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 79, in backward
model.backward(closure_loss, optimizer, opt_idx)
File "/home/whos/miniconda3/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1275, in backward
loss.backward(*args, **kwargs)
File "/home/whos/miniconda3/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/whos/miniconda3/lib/python3.8/site-packages/torch/autograd/init.py", line 130, in backward
Variable._execution_engine.run_backward(
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.
import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([4, 32, 112, 192, 160], dtype=torch.half, device='cuda', requires_grad=True)
net = torch.nn.Conv3d(32, 32, kernel_size=[3, 3, 3], padding=[1, 1, 1], stride=[1, 1, 1], dilation=[1, 1, 1], groups=1)
net = net.cuda().half()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()
ConvolutionParams
data_type = CUDNN_DATA_HALF
padding = [1, 1, 1]
stride = [1, 1, 1]
dilation = [1, 1, 1]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0x7fc3d00a8de0
type = CUDNN_DATA_HALF
nbDims = 5
dimA = 4, 32, 112, 192, 160,
strideA = 110100480, 3440640, 30720, 160, 1,
output: TensorDescriptor 0x7fc3d00aa920
type = CUDNN_DATA_HALF
nbDims = 5
dimA = 4, 32, 112, 192, 160,
strideA = 110100480, 3440640, 30720, 160, 1,
weight: FilterDescriptor 0x7fc3d00ab420
type = CUDNN_DATA_HALF
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 5
dimA = 32, 32, 3, 3, 3,
Pointer addresses:
input: 0x7fc5de000000
output: 0x7fc614000000
weight: 0x7fc4597c9c00
Additional pointer addresses:
grad_output: 0x7fc614000000
grad_input: 0x7fc5de000000
Backward data algorithm: 3
ConvolutionParams
data_type = CUDNN_DATA_HALF
padding = [1, 1, 1]
stride = [1, 1, 1]
dilation = [1, 1, 1]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0x7fc3d00a8de0
type = CUDNN_DATA_HALF
nbDims = 5
dimA = 4, 32, 112, 192, 160,
strideA = 110100480, 3440640, 30720, 160, 1,
output: TensorDescriptor 0x7fc3d00aa920
type = CUDNN_DATA_HALF
nbDims = 5
dimA = 4, 32, 112, 192, 160,
strideA = 110100480, 3440640, 30720, 160, 1,
weight: FilterDescriptor 0x7fc3d00ab420
type = CUDNN_DATA_HALF
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 5
dimA = 32, 32, 3, 3, 3,
Pointer addresses:
input: 0x7fc5de000000
output: 0x7fc614000000
weight: 0x7fc4597c9c00
Additional pointer addresses:
grad_output: 0x7fc614000000
grad_input: 0x7fc5de000000
Backward data algorithm: 3
Exception ignored in: <function tqdm.del at 0x7fc77f90a3a0>
Traceback (most recent call last):
File "/home/whos/miniconda3/lib/python3.8/site-packages/tqdm/std.py", line 1128, in del
File "/home/whos/miniconda3/lib/python3.8/site-packages/tqdm/std.py", line 1341, in close
File "/home/whos/miniconda3/lib/python3.8/site-packages/tqdm/std.py", line 1520, in display
File "/home/whos/miniconda3/lib/python3.8/site-packages/tqdm/std.py", line 1131, in repr
File "/home/whos/miniconda3/lib/python3.8/site-packages/tqdm/std.py", line 1481, in format_dict
TypeError: cannot unpack non-iterable NoneType object
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fc78556f8b2 in /home/whos/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7fc7857c1952 in /home/whos/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fc78555ab7d in /home/whos/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #3: std::vector<c10d::Reducer::Bucket, std::allocatorc10d::Reducer::Bucket >::~vector() + 0x312 (0x7fc821161912 in /home/whos/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: c10d::Reducer::~Reducer() + 0x342 (0x7fc8211601f2 in /home/whos/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7fc8211340a2 in /home/whos/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7fc820b2a9f6 in /home/whos/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: + 0x8c1e4f (0x7fc821135e4f in /home/whos/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: + 0x2c2c60 (0x7fc820b36c60 in /home/whos/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: + 0x2c3dce (0x7fc820b37dce in /home/whos/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: + 0x1289f5 (0x55f6270d59f5 in /home/whos/miniconda3/bin/python)
frame #11: + 0x1c0004 (0x55f62716d004 in /home/whos/miniconda3/bin/python)
frame #12: + 0x128926 (0x55f6270d5926 in /home/whos/miniconda3/bin/python)
frame #13: + 0x1c0004 (0x55f62716d004 in /home/whos/miniconda3/bin/python)
frame #14: + 0x128746 (0x55f6270d5746 in /home/whos/miniconda3/bin/python)
frame #15: + 0x1c0004 (0x55f62716d004 in /home/whos/miniconda3/bin/python)
frame #16: + 0x1289f5 (0x55f6270d59f5 in /home/whos/miniconda3/bin/python)
frame #17: + 0x1c0004 (0x55f62716d004 in /home/whos/miniconda3/bin/python)
frame #18: + 0x128a2a (0x55f6270d5a2a in /home/whos/miniconda3/bin/python)
frame #19: + 0x11d332 (0x55f6270ca332 in /home/whos/miniconda3/bin/python)
frame #20: + 0x13c255 (0x55f6270e9255 in /home/whos/miniconda3/bin/python)
frame #21: _PyGC_CollectNoFail + 0x2a (0x55f6271e285a in /home/whos/miniconda3/bin/python)
frame #22: PyImport_Cleanup + 0x295 (0x55f6271f94d5 in /home/whos/miniconda3/bin/python)
frame #23: Py_FinalizeEx + 0x7d (0x55f6271f968d in /home/whos/miniconda3/bin/python)
frame #24: Py_RunMain + 0x110 (0x55f6271fbb90 in /home/whos/miniconda3/bin/python)
frame #25: Py_BytesMain + 0x39 (0x55f6271fbd19 in /home/whos/miniconda3/bin/python)
frame #26: __libc_start_main + 0xf0 (0x7fc8239eb840 in /lib/x86_64-linux-gnu/libc.so.6)
frame #27: + 0x1dee93 (0x55f62718be93 in /home/whos/miniconda3/bin/python)
'
When I run the code snippet as suggested, it is fine.
I used a server with 250G memory and some V100 GPUs.
Could you please help to locate where the problem is?
Thank you so much.
The text was updated successfully, but these errors were encountered: