Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when using ONNX with TensorRT (ORT-TRT) Optimization on Multi-GPU #7885

Open
efajardo-nv opened this issue Dec 16, 2024 · 1 comment
Open

Comments

@efajardo-nv
Copy link

Description
We recently updated examples in the Morpheus project from using Triton Server 23.06 to 24.09. These example use automatic ORT-TRT optimization but we now get errors when running on multiple GPUs. Everything works as expected on single GPU. We can also get it to work on multi-GPU if we remove the ORT-TRT optimization from the config.pbtxt. This is the Morpheus issue for that.

Errors can also be reproduced using the Triton densenet_onnx example model by updating its config.pbtxt to use ORT-TRT optimization and running on multi-GPU.

This appears to be an issue only with the automatic ORT-TRT optimization within Triton. Errors are not seen after deploying a TRT engine (model.plan) that I manually converted from ONNX.

Triton Information
What version of Triton are you using?
24.09 but also get error with 24.11

Are you using the Triton container or did you build it yourself?
Triton container

To Reproduce
Follow steps in Quickstart. Before running Triton, update config.pbtxt for densenet_onnx to use ORT-TRT optimization by adding the following to the end of the file:

optimization { execution_accelerators {
  gpu_execution_accelerator : [ {
    name : "tensorrt"
    parameters { key: "precision_mode" value: "FP16" }
    parameters { key: "max_workspace_size_bytes" value: "1073741824" }
    }]
}}

Also, use --gpus=all to run Triton:

$ docker run --gpus=all --rm -p8000:8000 -p8001:8001 -p8002:8002 -v/full/path/to/docs/examples/model_repository:/models nvcr.io/nvidia/tritonserver:<xx.yy>-py3 tritonserver --model-repository=/models --model-control-mode=explicit --load-model densenet_onnx

On my machine with two GPUs (Quadro RTX 8000), I see error in Triton logs with every other inference request. After four inference requests:

I1216 19:31:43.982954 1 grpc_server.cc:2558] "Started GRPCInferenceService at 0.0.0.0:8001"
I1216 19:31:43.983283 1 http_server.cc:4729] "Started HTTPService at 0.0.0.0:8000"
I1216 19:31:44.025228 1 http_server.cc:362] "Started Metrics Service at 0.0.0.0:8002"
2024-12-16 19:32:44.509169653 [E:onnxruntime:log, tensorrt_execution_provider.h:88 log] [2024-12-16 19:32:44   ERROR] IExecutionContext::enqueueV3: Error Code 1: Cask (Cask convolution execution)
2024-12-16 19:32:44.509209195 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running TRTKernel_graph_densenet121_13954451369262798226_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_densenet121_13954451369262798226_0_0' Status Message: TensorRT EP execution context enqueue failed.
2024-12-16 19:32:51.214520104 [E:onnxruntime:log, tensorrt_execution_provider.h:88 log] [2024-12-16 19:32:51   ERROR] IExecutionContext::enqueueV3: Error Code 1: Cuda Runtime (invalid resource handle)
2024-12-16 19:32:51.214547226 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running TRTKernel_graph_densenet121_13954451369262798226_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_densenet121_13954451369262798226_0_0' Status Message: TensorRT EP execution context enqueue failed.

Expected behavior
No errors with ORT-TRT optimization on multi-GPU.

@hoangphuc1998
Copy link

I experienced the same issue. Any suggestions on how to fix this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants