Error when using ONNX with TensorRT (ORT-TRT) Optimization on Multi-GPU #7885

efajardo-nv · 2024-12-16T20:18:17Z

Description
We recently updated examples in the Morpheus project from using Triton Server 23.06 to 24.09. These example use automatic ORT-TRT optimization but we now get errors when running on multiple GPUs. Everything works as expected on single GPU. We can also get it to work on multi-GPU if we remove the ORT-TRT optimization from the config.pbtxt. This is the Morpheus issue for that.

Errors can also be reproduced using the Triton densenet_onnx example model by updating its config.pbtxt to use ORT-TRT optimization and running on multi-GPU.

This appears to be an issue only with the automatic ORT-TRT optimization within Triton. Errors are not seen after deploying a TRT engine (model.plan) that I manually converted from ONNX.

Triton Information
What version of Triton are you using?
24.09 but also get error with 24.11

Are you using the Triton container or did you build it yourself?
Triton container

To Reproduce
Follow steps in Quickstart. Before running Triton, update config.pbtxt for densenet_onnx to use ORT-TRT optimization by adding the following to the end of the file:

optimization { execution_accelerators {
  gpu_execution_accelerator : [ {
    name : "tensorrt"
    parameters { key: "precision_mode" value: "FP16" }
    parameters { key: "max_workspace_size_bytes" value: "1073741824" }
    }]
}}

Also, use --gpus=all to run Triton:

$ docker run --gpus=all --rm -p8000:8000 -p8001:8001 -p8002:8002 -v/full/path/to/docs/examples/model_repository:/models nvcr.io/nvidia/tritonserver:<xx.yy>-py3 tritonserver --model-repository=/models --model-control-mode=explicit --load-model densenet_onnx

On my machine with two GPUs (Quadro RTX 8000), I see error in Triton logs with every other inference request. After four inference requests:

I1216 19:31:43.982954 1 grpc_server.cc:2558] "Started GRPCInferenceService at 0.0.0.0:8001"
I1216 19:31:43.983283 1 http_server.cc:4729] "Started HTTPService at 0.0.0.0:8000"
I1216 19:31:44.025228 1 http_server.cc:362] "Started Metrics Service at 0.0.0.0:8002"
2024-12-16 19:32:44.509169653 [E:onnxruntime:log, tensorrt_execution_provider.h:88 log] [2024-12-16 19:32:44   ERROR] IExecutionContext::enqueueV3: Error Code 1: Cask (Cask convolution execution)
2024-12-16 19:32:44.509209195 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running TRTKernel_graph_densenet121_13954451369262798226_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_densenet121_13954451369262798226_0_0' Status Message: TensorRT EP execution context enqueue failed.
2024-12-16 19:32:51.214520104 [E:onnxruntime:log, tensorrt_execution_provider.h:88 log] [2024-12-16 19:32:51   ERROR] IExecutionContext::enqueueV3: Error Code 1: Cuda Runtime (invalid resource handle)
2024-12-16 19:32:51.214547226 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running TRTKernel_graph_densenet121_13954451369262798226_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_densenet121_13954451369262798226_0_0' Status Message: TensorRT EP execution context enqueue failed.

Expected behavior
No errors with ORT-TRT optimization on multi-GPU.

The text was updated successfully, but these errors were encountered:

hoangphuc1998 · 2024-12-20T04:49:27Z

I experienced the same issue. Any suggestions on how to fix this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when using ONNX with TensorRT (ORT-TRT) Optimization on Multi-GPU #7885

Error when using ONNX with TensorRT (ORT-TRT) Optimization on Multi-GPU #7885

efajardo-nv commented Dec 16, 2024

hoangphuc1998 commented Dec 20, 2024

Error when using ONNX with TensorRT (ORT-TRT) Optimization on Multi-GPU #7885

Error when using ONNX with TensorRT (ORT-TRT) Optimization on Multi-GPU #7885

Comments

efajardo-nv commented Dec 16, 2024

hoangphuc1998 commented Dec 20, 2024