Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDNN failure 1: CUDNN_STATUS_NOT_INITIALIZED #181

Open
Nyquist0 opened this issue Aug 14, 2024 · 3 comments
Open

CUDNN failure 1: CUDNN_STATUS_NOT_INITIALIZED #181

Nyquist0 opened this issue Aug 14, 2024 · 3 comments

Comments

@Nyquist0
Copy link

Dear Sir or Madam,

I met the following error that keeps interrupting my training process.
This happened after 1000 steps and is a ONNXRuntimeError error
Could you help to check if there is anything wrong?

Environment:

  • 1x RTX 5880 Ada 48G (Ada architecture)
  • batch size: 1
  • python environment: conda environment you provided

commands:
CUDA_VISIBLE_DEVICES=1 accelerate launch -m --config_file accelerate_config.yaml --machine_rank 0 --main_process_ip 0.0.0.0 --main_process_port 20055 --num_machines 1 --num_processes 1 scripts.train_stage1 --config ./configs/train/stage1.yaml

error:

...
[2024-08-13 21:32:16,393] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint pytorch_model is ready now!                                                                     
INFO:accelerate.accelerator:DeepSpeed Model and Optimizer saved to output dir ./exp_output/stage1/checkpoints/checkpoint-4000/pytorch_model                                                
INFO:accelerate.checkpointing:Scheduler state saved in exp_output/stage1/checkpoints/checkpoint-4000/scheduler.bin                                                                         
INFO:accelerate.checkpointing:Sampler state for dataloader 0 saved in exp_output/stage1/checkpoints/checkpoint-4000/sampler.bin                                                            
INFO:accelerate.checkpointing:Random states saved in exp_output/stage1/checkpoints/checkpoint-4000/random_states_0.pkl                                                                     
3 checkpoints already exist, removing 1 checkpoints                                                                                                                                        
Removing checkpoints: reference_unet-2500.pth                                                                                                                                              
Checkpoint saved at ./exp_output/stage1/modules/reference_unet-4000.pth                                                                                                                    
3 checkpoints already exist, removing 1 checkpoints                                                                                                                                        
Removing checkpoints: imageproj-2500.pth                                                                                                                                                   
Checkpoint saved at ./exp_output/stage1/modules/imageproj-4000.pth                                                                                                                         
3 checkpoints already exist, removing 1 checkpoints                                                                                                                                        
Removing checkpoints: denoising_unet-2500.pth                                                                                                                                              
Checkpoint saved at ./exp_output/stage1/modules/denoising_unet-4000.pth                                                                                                                    
3 checkpoints already exist, removing 1 checkpoints                                                                                                                                        
Removing checkpoints: face_locator-2500.pth                                                                                                                                                
Checkpoint saved at ./exp_output/stage1/modules/face_locator-4000.pth                                                                                                                      
INFO:__main__:Running validation...                                                                                                                                                        
Applied providers: ['CUDAExecutionProvider', 'CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}, 'CUDAExecutionProvider': {'prefer_nhwc': '0', 'enable_skip_layer_norm_stri
ct_mode': '0', 'tunable_op_enable': '0', 'enable_cuda_graph': '0', 'tunable_op_max_tuning_duration_ms': '0', 'tunable_op_tuning_enable': '0', 'cudnn_conv_use_max_workspace': '1', 'use_tf3
2': '1', 'cudnn_conv1d_pad_to_nc1d': '0', 'do_copy_in_default_stream': '1', 'cudnn_conv_algo_search': 'EXHAUSTIVE', 'gpu_external_empty_cache': '0', 'gpu_external_free': '0', 'gpu_externa
l_alloc': '0', 'gpu_mem_limit': '18446744073709551615', 'arena_extend_strategy': 'kNextPowerOfTwo', 'user_compute_stream': '0', 'has_user_compute_stream': '0', 'use_ep_level_unified_strea
m': '0', 'device_id': '0'}}                                                                                                                                                                
find model: ./pretrained_models/face_analysis/models/1k3d68.onnx landmark_3d_68 ['None', 3, 192, 192] 0.0 1.0                                                                              
Applied providers: ['CUDAExecutionProvider', 'CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}, 'CUDAExecutionProvider': {'prefer_nhwc': '0', 'enable_skip_layer_norm_stri
ct_mode': '0', 'tunable_op_enable': '0', 'enable_cuda_graph': '0', 'tunable_op_max_tuning_duration_ms': '0', 'tunable_op_tuning_enable': '0', 'cudnn_conv_use_max_workspace': '1', 'use_tf3
2': '1', 'cudnn_conv1d_pad_to_nc1d': '0', 'do_copy_in_default_stream': '1', 'cudnn_conv_algo_search': 'EXHAUSTIVE', 'gpu_external_empty_cache': '0', 'gpu_external_free': '0', 'gpu_externa
l_alloc': '0', 'gpu_mem_limit': '18446744073709551615', 'arena_extend_strategy': 'kNextPowerOfTwo', 'user_compute_stream': '0', 'has_user_compute_stream': '0', 'use_ep_level_unified_strea
m': '0', 'device_id': '0'}}                                                                                                                                                                
find model: ./pretrained_models/face_analysis/models/2d106det.onnx landmark_2d_106 ['None', 3, 192, 192] 0.0 1.0                                                                           
Applied providers: ['CUDAExecutionProvider', 'CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}, 'CUDAExecutionProvider': {'prefer_nhwc': '0', 'enable_skip_layer_norm_stri
ct_mode': '0', 'tunable_op_enable': '0', 'enable_cuda_graph': '0', 'tunable_op_max_tuning_duration_ms': '0', 'tunable_op_tuning_enable': '0', 'cudnn_conv_use_max_workspace': '1', 'use_tf3
2': '1', 'cudnn_conv1d_pad_to_nc1d': '0', 'do_copy_in_default_stream': '1', 'cudnn_conv_algo_search': 'EXHAUSTIVE', 'gpu_external_empty_cache': '0', 'gpu_external_free': '0', 'gpu_externa
l_alloc': '0', 'gpu_mem_limit': '18446744073709551615', 'arena_extend_strategy': 'kNextPowerOfTwo', 'user_compute_stream': '0', 'has_user_compute_stream': '0', 'use_ep_level_unified_strea
m': '0', 'device_id': '0'}}                   
find model: ./pretrained_models/face_analysis/models/genderage.onnx genderage ['None', 3, 96, 96] 0.0 1.0
Applied providers: ['CUDAExecutionProvider', 'CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}, 'CUDAExecutionProvider': {'prefer_nhwc': '0', 'enable_skip_layer_norm_stri
ct_mode': '0', 'tunable_op_enable': '0', 'enable_cuda_graph': '0', 'tunable_op_max_tuning_duration_ms': '0', 'tunable_op_tuning_enable': '0', 'cudnn_conv_use_max_workspace': '1', 'use_tf3
2': '1', 'cudnn_conv1d_pad_to_nc1d': '0', 'do_copy_in_default_stream': '1', 'cudnn_conv_algo_search': 'EXHAUSTIVE', 'gpu_external_empty_cache': '0', 'gpu_external_free': '0', 'gpu_externa
l_alloc': '0', 'gpu_mem_limit': '18446744073709551615', 'arena_extend_strategy': 'kNextPowerOfTwo', 'user_compute_stream': '0', 'has_user_compute_stream': '0', 'use_ep_level_unified_strea
m': '0', 'device_id': '0'}}                   
find model: ./pretrained_models/face_analysis/models/glintr100.onnx recognition ['None', 3, 112, 112] 127.5 127.5

2024-08-13 21:33:53.948346489 [E:onnxruntime:, inference_session.cc:2045 operator()] Exception during initialization: /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:123 std
::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudnnStatus_t; bo
ol THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:116 std::conditional_t<THRW, void, onnxru
ntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudnnStatus_t; bool THRW = true; std::conditional_t
<THRW, void, onnxruntime::common::Status> = void] CUDNN failure 1: CUDNN_STATUS_NOT_INITIALIZED ; GPU=0 ; hostname=lancel-server ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cu
da_execution_provider.cc ; line=181 ; expr=cudnnCreate(&cudnn_handle_); 


ERROR:root:Failed to execute the training process: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call
.cc:123 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudnnS
tatus_t; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:116 std::conditional_t<THRW, v
oid, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudnnStatus_t; bool THRW = true; std::co
nditional_t<THRW, void, onnxruntime::common::Status> = void] CUDNN failure 1: CUDNN_STATUS_NOT_INITIALIZED ; GPU=0 ; hostname=lancel-server ; file=/onnxruntime_src/onnxruntime/core/provid
ers/cuda/cuda_execution_provider.cc ; line=181 ; expr=cudnnCreate(&cudnn_handle_); 

Looking forward your reply. Thanks.

@xumingw
Copy link
Contributor

xumingw commented Aug 14, 2024

Please check your onnx version, the inference step needs onnxruntime
Does the inference script work?

@Nyquist0
Copy link
Author

Nyquist0 commented Aug 14, 2024

I am directly training. Let me check the inference script.
And the onnx version is completely aligned with yours in requirements.txt

@Nyquist0
Copy link
Author

Hi @xumingw
Inference works well. Is it possible the onnx version you provided is compatible with Ampere architecture, but not with Ada architecture..?
Any suggestions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants