You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I encountered an issue with ONNX Runtime when running CUDA sessions in Unity. In Python, I am able to create three(mutiple) CUDA sessions for my models on a single graphic card and run them sequentially for inference without any issues. The GPU is utilized correctly, and each model returns the expected predictions.
However, when attempting to replicate this setup in Unity:
If I create only one CUDA session, the inference runs correctly, and the output is as expected.
If I create two CUDA sessions and run the models sequentially, the inference runs without errors, but the output values are empty.
The same models work perfectly in Python with multiple CUDA sessions, but in Unity, only the first CUDA session seems to work as intended. Additional context:
GPU Model: nVidia A6000
GPU Memory: 48GB
Unity Version: 2023.2.20f1
To reproduce
Create CUDA sessions for two models in Unity using ONNX Runtime.
Load the models into the CUDA sessions.
Run the models sequentially for inference.
Observe that while the output values are produced, they are empty.
if (!OrtEnv.IsCreated)
{
var envOptions = new EnvironmentCreationOptions
{
logId = "FaceDetect",
logLevel = OrtLoggingLevel.ORT_LOGGING_LEVEL_VERBOSE,
loggingFunction = MyCustomLoggingFunction,
threadOptions = null,
};
OrtEnv.CreateInstanceWithOptions(ref envOptions);
}
OrtCUDAProviderOptions cudaOptionFaceDetect = new OrtCUDAProviderOptions();
var providerOptionsDict = new Dictionary
{
["device_id"] = "0",
["gpu_mem_limit"] = "2147483648",
};
cudaOptionFaceDetect.UpdateOptions(providerOptionsDict);
//************************ Face Detection Model **********************
faceDetectSessionOptions = SessionOptions.MakeSessionOptionWithCudaProvider(cudaOptionFaceDetect);
faceDetectSessionOptions.GraphOptimizationLevel = GraphOptimizationLevel.ORT_DISABLE_ALL;
faceDetectSessionOptions.LogVerbosityLevel = 3;
faceDetectSessionOptions.LogSeverityLevel = OrtLoggingLevel.ORT_LOGGING_LEVEL_VERBOSE;
faceDetect = new FaceDetect(faceDetectModel.bytes, faceDetectOptions, faceDetectSessionOptions);
// ****************** Face Mesh Model ********************
OrtCUDAProviderOptions cudaOptionFaceMesh = new OrtCUDAProviderOptions();
cudaOptionFaceMesh.UpdateOptions(providerOptionsDict);
faceMeshSessionOptions = SessionOptions.MakeSessionOptionWithCudaProvider(cudaOptionFaceDetect);
faceMeshSessionOptions.GraphOptimizationLevel = GraphOptimizationLevel.ORT_DISABLE_ALL;
faceMeshSessionOptions.LogVerbosityLevel = 3;
faceMeshSessionOptions.LogSeverityLevel = OrtLoggingLevel.ORT_LOGGING_LEVEL_VERBOSE;
faceMesh = new FaceMesh(faceMeshModel.bytes, faceMeshOptions, faceMeshSessionOptions);
Explanation of Behavior Change When faceMesh is commented out: The code only initializes and runs the face detection model (faceDetect). In this case, the application will only perform face detection and not the more detailed face mesh analysis. Since only one model (face detection) is loaded, the ONNX Runtime is managing a single CUDA session, which might work without any issues.
When faceMesh is not commented out: Both the face detection model (faceDetect) and the face mesh model (faceMesh) are initialized. This creates two CUDA sessions using the same OrtCUDAProviderOptions. Initializing multiple sessions with the same CUDA provider settings may lead to conflicts in internal graph, resulting in empty outputs. This could explain why, when both models are used sequentially in Unity, the output values are empty.
Important logs:
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (inference_session.cc:583 onnxruntime::InferenceSession::TraceSessionOptions): Session Options { execution_mode:0 execution_order:DEFAULT enable_profiling:0 optimized_model_filepath: enable_mem_pattern:1 enable_mem_reuse:1 enable_cpu_mem_arena:1 profile_file_prefix:onnxruntime_profile_ session_logid: session_log_severity_level:0 session_log_verbosity_level:10 max_num_graph_transformation_steps:10 graph_optimization_level:0 intra_op_param:OrtThreadPoolParams { thread_pool_size: 0 auto_set_affinity: 0 allow_spinning: 1 dynamic_block_base_: 0 stack_size: 0 affinity_str: set_denormal_as_zero: 0 } inter_op_param:OrtThreadPoolParams { thread_pool_size: 0 auto_set_affinity: 0 allow_spinning: 1 dynamic_block_base_: 0 stack_size: 0 affinity_str: set_denormal_as_zero: 0 } use_per_session_threads:1 thread_pool_allow_spinning:1 use_deterministic_compute:0 config_options: { } }
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (inference_session.cc:491 onnxruntime::InferenceSession::ConstructorCommon): Creating and using per session threadpools since use_per_session_threads_ is true
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (inference_session.cc:509 onnxruntime::InferenceSession::ConstructorCommon): Dynamic block base set to 0
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (inference_session.cc:1669 onnxruntime::InferenceSession::Initialize): Initializing session.
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (inference_session.cc:1706 onnxruntime::InferenceSession::Initialize): Adding default CPU execution provider.
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:29 onnxruntime::BFCArena::BFCArena): Creating BFCArena for Cuda with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 initial_growth_chunk_size_bytes: 2097152 max_power_of_two_extend_bytes: 1073741824 memory limit: 2147483648 arena_extend_strategy: 0
[FaceDetect-ORT_LOGGING_LEVEL_VERBOSE] onnxruntime (bfc_arena.cc:66 onnxruntime::BFCArena::BFCArena): Creating 21 bins of max chunk size 256 to 268435456
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:29 onnxruntime::BFCArena::BFCArena): Creating BFCArena for CudaPinned with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 initial_growth_chunk_size_bytes: 2097152 max_power_of_two_extend_bytes: 1073741824 memory limit: 18446744073709551615 arena_extend_strategy: 0
[FaceDetect-ORT_LOGGING_LEVEL_VERBOSE] onnxruntime (bfc_arena.cc:66 onnxruntime::BFCArena::BFCArena): Creating 21 bins of max chunk size 256 to 268435456
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:29 onnxruntime::BFCArena::BFCArena): Creating BFCArena for Cpu with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 initial_growth_chunk_size_bytes: 2097152 max_power_of_two_extend_bytes: 1073741824 memory limit: 18446744073709551615 arena_extend_strategy: 0
[FaceDetect-ORT_LOGGING_LEVEL_VERBOSE] onnxruntime (bfc_arena.cc:66 onnxruntime::BFCArena::BFCArena): Creating 21 bins of max chunk size 256 to 268435456
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (graph_partitioner.cc:898 onnxruntime::GraphPartitioner::InlineFunctionsAOT): This model does not have any local functions defined. AOT Inlining is not performed
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply): GraphTransformer EnsureUniqueDQForNodeUnit modified: 0 with status: OK
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply): GraphTransformer RemoveDuplicateCastTransformer modified: 0 with status: OK
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply): GraphTransformer CastFloat16Transformer modified: 0 with status: OK
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply): GraphTransformer MemcpyTransformer modified: 0 with status: OK
[-ORT_LOGGING_LEVEL_VERBOSE] onnxruntime (session_state.cc:1148 onnxruntime::VerifyEachNodeIsAssignedToAnEp): Node placements
[-ORT_LOGGING_LEVEL_VERBOSE] onnxruntime (session_state.cc:1151 onnxruntime::VerifyEachNodeIsAssignedToAnEp): All nodes placed on [CUDAExecutionProvider]. Number of nodes: 94
[-ORT_LOGGING_LEVEL_VERBOSE] onnxruntime (session_state.cc:128 onnxruntime::SessionState::CreateGraphInfo): SaveMLValueNameIndexMapping
[-ORT_LOGGING_LEVEL_VERBOSE] onnxruntime (session_state.cc:174 onnxruntime::SessionState::CreateGraphInfo): Done saving OrtValue mappings.
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (cuda_execution_provider.cc:184 onnxruntime::CUDAExecutionProvider::PerThreadContext::PerThreadContext): cuDNN version: 90400
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (allocation_planner.cc:2567 onnxruntime::IGraphPartitioner::CreateGraphPartitioner): Use DeviceBasedPartition as default
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (session_state_utils.cc:276 onnxruntime::session_state_utils::SaveInitializedTensors): Saving initialized tensors.
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:347 onnxruntime::BFCArena::AllocateRawInternal): Extending BFCArena for Cuda. bin_num:0 (requested) num_bytes: 144 (actual) rounded_bytes:256
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:206 onnxruntime::BFCArena::Extend): Extended allocation by 1048576 bytes.
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:209 onnxruntime::BFCArena::Extend): Total allocated bytes: 1048576
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:212 onnxruntime::BFCArena::Extend): Allocated memory at 0000001410024400 to 0000001410124400
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:347 onnxruntime::BFCArena::AllocateRawInternal): Extending BFCArena for Cpu. bin_num:0 (requested) num_bytes: 144 (actual) rounded_bytes:256
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:206 onnxruntime::BFCArena::Extend): Extended allocation by 1048576 bytes.
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:209 onnxruntime::BFCArena::Extend): Total allocated bytes: 1048576
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:212 onnxruntime::BFCArena::Extend): Allocated memory at 0000026205E21080 to 0000026205F21080
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (session_state_utils.cc:427 onnxruntime::session_state_utils::SaveInitializedTensors): Done saving initialized tensors
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (inference_session.cc:2106 onnxruntime::InferenceSession::Initialize): Session successfully initialized.
Version: 1.20.0
Input:
[input_g1] shape: 1,3,128,128, type: System.Single isTensor: True
Output:
[classificators_g1] shape: 1,896,1, type: System.Single isTensor: True
[regressors_g1] shape: 1,896,16, type: System.Single isTensor: True
[ImageInference.AllocateTensors] Input: input_g1: shape: 1,3,128,128, type: System.Single isTensor: True
[ImageInference.AllocateTensors] Input: classificators_g1: shape: 1,896,1, type: System.Single isTensor: True
[ImageInference.AllocateTensors] Input: regressors_g1: shape: 1,896,16, type: System.Single isTensor: True
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (inference_session.cc:583 onnxruntime::InferenceSession::TraceSessionOptions): Session Options { execution_mode:0 execution_order:DEFAULT enable_profiling:0 optimized_model_filepath: enable_mem_pattern:1 enable_mem_reuse:1 enable_cpu_mem_arena:1 profile_file_prefix:onnxruntime_profile_ session_logid: session_log_severity_level:0 session_log_verbosity_level:10 max_num_graph_transformation_steps:10 graph_optimization_level:0 intra_op_param:OrtThreadPoolParams { thread_pool_size: 0 auto_set_affinity: 0 allow_spinning: 1 dynamic_block_base_: 0 stack_size: 0 affinity_str: set_denormal_as_zero: 0 } inter_op_param:OrtThreadPoolParams { thread_pool_size: 0 auto_set_affinity: 0 allow_spinning: 1 dynamic_block_base_: 0 stack_size: 0 affinity_str: set_denormal_as_zero: 0 } use_per_session_threads:1 thread_pool_allow_spinning:1 use_deterministic_compute:0 config_options: { } }
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (inference_session.cc:491 onnxruntime::InferenceSession::ConstructorCommon): Creating and using per session threadpools since use_per_session_threads_ is true
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (inference_session.cc:509 onnxruntime::InferenceSession::ConstructorCommon): Dynamic block base set to 0
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (inference_session.cc:1669 onnxruntime::InferenceSession::Initialize): Initializing session.
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (inference_session.cc:1706 onnxruntime::InferenceSession::Initialize): Adding default CPU execution provider.
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:29 onnxruntime::BFCArena::BFCArena): Creating BFCArena for Cuda with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 initial_growth_chunk_size_bytes: 2097152 max_power_of_two_extend_bytes: 1073741824 memory limit: 2147483648 arena_extend_strategy: 0
[FaceDetect-ORT_LOGGING_LEVEL_VERBOSE] onnxruntime (bfc_arena.cc:66 onnxruntime::BFCArena::BFCArena): Creating 21 bins of max chunk size 256 to 268435456
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:29 onnxruntime::BFCArena::BFCArena): Creating BFCArena for CudaPinned with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 initial_growth_chunk_size_bytes: 2097152 max_power_of_two_extend_bytes: 1073741824 memory limit: 18446744073709551615 arena_extend_strategy: 0
[FaceDetect-ORT_LOGGING_LEVEL_VERBOSE] onnxruntime (bfc_arena.cc:66 onnxruntime::BFCArena::BFCArena): Creating 21 bins of max chunk size 256 to 268435456
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:29 onnxruntime::BFCArena::BFCArena): Creating BFCArena for Cpu with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 initial_growth_chunk_size_bytes: 2097152 max_power_of_two_extend_bytes: 1073741824 memory limit: 18446744073709551615 arena_extend_strategy: 0
[FaceDetect-ORT_LOGGING_LEVEL_VERBOSE] onnxruntime (bfc_arena.cc:66 onnxruntime::BFCArena::BFCArena): Creating 21 bins of max chunk size 256 to 268435456
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (graph_partitioner.cc:898 onnxruntime::GraphPartitioner::InlineFunctionsAOT): This model does not have any local functions defined. AOT Inlining is not performed
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply): GraphTransformer EnsureUniqueDQForNodeUnit modified: 0 with status: OK
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply): GraphTransformer RemoveDuplicateCastTransformer modified: 0 with status: OK
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply): GraphTransformer CastFloat16Transformer modified: 0 with status: OK
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply): GraphTransformer MemcpyTransformer modified: 0 with status: OK
[-ORT_LOGGING_LEVEL_VERBOSE] onnxruntime (session_state.cc:1148 onnxruntime::VerifyEachNodeIsAssignedToAnEp): Node placements
[-ORT_LOGGING_LEVEL_VERBOSE] onnxruntime (session_state.cc:1151 onnxruntime::VerifyEachNodeIsAssignedToAnEp): All nodes placed on [CUDAExecutionProvider]. Number of nodes: 498
[-ORT_LOGGING_LEVEL_VERBOSE] onnxruntime (session_state.cc:128 onnxruntime::SessionState::CreateGraphInfo): SaveMLValueNameIndexMapping
[-ORT_LOGGING_LEVEL_VERBOSE] onnxruntime (session_state.cc:174 onnxruntime::SessionState::CreateGraphInfo): Done saving OrtValue mappings.
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (cuda_execution_provider.cc:184 onnxruntime::CUDAExecutionProvider::PerThreadContext::PerThreadContext): cuDNN version: 90400
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (allocation_planner.cc:2567 onnxruntime::IGraphPartitioner::CreateGraphPartitioner): Use DeviceBasedPartition as default
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (session_state_utils.cc:276 onnxruntime::session_state_utils::SaveInitializedTensors): Saving initialized tensors.
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:347 onnxruntime::BFCArena::AllocateRawInternal): Extending BFCArena for Cuda. bin_num:1 (requested) num_bytes: 512 (actual) rounded_bytes:512
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:206 onnxruntime::BFCArena::Extend): Extended allocation by 1048576 bytes.
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:209 onnxruntime::BFCArena::Extend): Total allocated bytes: 1048576
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:212 onnxruntime::BFCArena::Extend): Allocated memory at 0000001411400000 to 0000001411500000
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:347 onnxruntime::BFCArena::AllocateRawInternal): Extending BFCArena for Cpu. bin_num:1 (requested) num_bytes: 512 (actual) rounded_bytes:512
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:206 onnxruntime::BFCArena::Extend): Extended allocation by 1048576 bytes.
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:209 onnxruntime::BFCArena::Extend): Total allocated bytes: 1048576
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:212 onnxruntime::BFCArena::Extend): Allocated memory at 000002620621E080 to 000002620631E080
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:347 onnxruntime::BFCArena::AllocateRawInternal): Extending BFCArena for Cuda. bin_num:13 (requested) num_bytes: 2936832 (actual) rounded_bytes:2936832
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:206 onnxruntime::BFCArena::Extend): Extended allocation by 4194304 bytes.
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:209 onnxruntime::BFCArena::Extend): Total allocated bytes: 5242880
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:212 onnxruntime::BFCArena::Extend): Allocated memory at 0000001411600000 to 0000001411A00000
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:347 onnxruntime::BFCArena::AllocateRawInternal): Extending BFCArena for Cpu. bin_num:13 (requested) num_bytes: 2936832 (actual) rounded_bytes:2936832
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:206 onnxruntime::BFCArena::Extend): Extended allocation by 4194304 bytes.
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:209 onnxruntime::BFCArena::Extend): Total allocated bytes: 5242880
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:212 onnxruntime::BFCArena::Extend): Allocated memory at 0000026206329080 to 0000026206729080
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:347 onnxruntime::BFCArena::AllocateRawInternal): Extending BFCArena for Cuda. bin_num:9 (requested) num_bytes: 131072 (actual) rounded_bytes:131072
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:206 onnxruntime::BFCArena::Extend): Extended allocation by 4194304 bytes.
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:209 onnxruntime::BFCArena::Extend): Total allocated bytes: 9437184
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:212 onnxruntime::BFCArena::Extend): Allocated memory at 0000001411A00000 to 0000001411E00000
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (session_state_utils.cc:427 onnxruntime::session_state_utils::SaveInitializedTensors): Done saving initialized tensors
[-ORT_LOGGING_LEVEL_INFO] onnxruntime (inference_session.cc:2106 onnxruntime::InferenceSession::Initialize): Session successfully initialized.
Version: 1.20.0
Input:
[input_12_g2] shape: 1,3,256,256, type: System.Single isTensor: True
Output:
[Identity_g2] shape: 1,1,1,1434, type: System.Single isTensor: True
[Identity_1_g2] shape: 1,1,1,1, type: System.Single isTensor: True
[Identity_2_g2] shape: 1,1, type: System.Single isTensor: True
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:347 onnxruntime::BFCArena::AllocateRawInternal): Extending BFCArena for Cuda. bin_num:11 (requested) num_bytes: 597040 (actual) rounded_bytes:597248
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:206 onnxruntime::BFCArena::Extend): Extended allocation by 2097152 bytes.
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:209 onnxruntime::BFCArena::Extend): Total allocated bytes: 3145728
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:212 onnxruntime::BFCArena::Extend): Allocated memory at 0000001412600000 to 0000001412800000
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:347 onnxruntime::BFCArena::AllocateRawInternal): Extending BFCArena for CudaPinned. bin_num:0 (requested) num_bytes: 16 (actual) rounded_bytes:256
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:206 onnxruntime::BFCArena::Extend): Extended allocation by 1048576 bytes.
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:209 onnxruntime::BFCArena::Extend): Total allocated bytes: 1048576
[FaceDetect-ORT_LOGGING_LEVEL_INFO] onnxruntime (bfc_arena.cc:212 onnxruntime::BFCArena::Extend): Allocated memory at 0000000304C00600 to 0000000304D00600
Urgency
This issue is blocking a critical use case in our project. We need to run multiple models sequentially using CUDA sessions in Unity. Any delay in resolving this issue would impact our project timeline significantly.
This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.
I have dropped to use the onnxruntime in my current project. Instead, I now directly use nVidia TensorRT, which creates the other problems, but it is controllable.
Describe the issue
I encountered an issue with ONNX Runtime when running CUDA sessions in Unity. In Python, I am able to create three(mutiple) CUDA sessions for my models on a single graphic card and run them sequentially for inference without any issues. The GPU is utilized correctly, and each model returns the expected predictions.
However, when attempting to replicate this setup in Unity:
The same models work perfectly in Python with multiple CUDA sessions, but in Unity, only the first CUDA session seems to work as intended. Additional context:
To reproduce
Python code equivalent (working):
Unity code (not working as expected):
Explanation of Behavior Change
When faceMesh is commented out: The code only initializes and runs the face detection model (faceDetect). In this case, the application will only perform face detection and not the more detailed face mesh analysis. Since only one model (face detection) is loaded, the ONNX Runtime is managing a single CUDA session, which might work without any issues.
When faceMesh is not commented out: Both the face detection model (faceDetect) and the face mesh model (faceMesh) are initialized. This creates two CUDA sessions using the same OrtCUDAProviderOptions. Initializing multiple sessions with the same CUDA provider settings may lead to conflicts in internal graph, resulting in empty outputs. This could explain why, when both models are used sequentially in Unity, the output values are empty.
Important logs:
Urgency
This issue is blocking a critical use case in our project. We need to run multiple models sequentially using CUDA sessions in Unity. Any delay in resolving this issue would impact our project timeline significantly.
Platform
Windows
OS Version
Windows 11 Pro
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
291a535
ONNX Runtime API
C#
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
CUDA 12.5.1, CUDNN 9.4, TensorRT 10.4.0.26
The text was updated successfully, but these errors were encountered: