Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorRT EP could not deserialize engine from binary data #22139

Closed
adaber opened this issue Sep 18, 2024 · 29 comments
Closed

TensorRT EP could not deserialize engine from binary data #22139

adaber opened this issue Sep 18, 2024 · 29 comments
Assignees
Labels
api:CSharp issues related to the C# API ep:TensorRT issues related to TensorRT execution provider performance issues related to performance regressions

Comments

@adaber
Copy link

adaber commented Sep 18, 2024

Describe the issue

Hi,

I've wrapped a TensorRT engine in an _ctx.onnx file using an official python script (https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/tensorrt/gen_trt_engine_wrapper_onnx_model.py#L156-L187)

The problem is that I get the "TensorRT EP could not deserialize engine from binary data" error. The TensorRT model works well using the TensorRT API. I am kind of stuck since there is no other information to help me figure out why this happens.

I've tried using different ortTrtOptions but to no avail.

This error occurs when creating an inference session. I tried both the FP16 and INT8 version and I got the same error.

I've uploaded the FP16 version and it'd be great if you have time to look at it.

Thanks!

Edit:

Graphics Card: 3090

The trt engine was built using the following profile shapes:
min: 1x1024x128x3
opt: 1x4096x640x3
max: 1x8000x1400x3

To reproduce

EmbededTrtEngine_FP16_ctx.zip

Urgency

Both a workaround or a fix would help.

Platform

Windows

OS Version

10

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.19.0

ONNX Runtime API

C#

Architecture

X64

Execution Provider

TensorRT

Execution Provider Library Version

CUDA 11.8, CuDNN 8.9.7.29, TRT 10.4.0.26 and 10.1.0.27

Model File

EmbededTrtEngine_FP16_ctx.zip

Is this a quantized model?

No

@adaber adaber added the performance issues related to performance regressions label Sep 18, 2024
@github-actions github-actions bot added api:CSharp issues related to the C# API ep:TensorRT issues related to TensorRT execution provider labels Sep 18, 2024
@jywu-msft
Copy link
Member

thanks for reporting the issue and attaching the context model file. it's difficult for me to directly
trt engines are platform/version specific and i'm unable to load your engine with my ampere based gpu.

perhaps we can start from the beginning using the same example to confirm the basic mechanism is working for you.

first thing I discovered is the gen_trt_engine_wrapper_onnx_model script needs a small modification


is deprecated and needs to be updated to num_io_bindings (which I assume you did since you were able to generate the onnx file)

I downloaded https://github.com/onnx/models/blob/main/validated/vision/classification/mobilenet/model/mobilenetv2-12.onnx
and generated a trt engine using trtexec (using TensorRT 10.4)
like this:

trtexec.exe --onnx=mobilenetv2-12.onnx --saveEngine=trt.engine

then I used the updated script to generate the epcontext onnx file

C:\tensorrt\TensorRT-10.4.0.26\bin>python gen_trt_engine_wrapper_onnx_model.py -p trt.engine -m test.onnx -e 1
['input']
[<DataType.FLOAT: 0>]
[(1, 3, 224, 224)]
['output']
[<DataType.FLOAT: 0>]
[(1, 1000)]
test.onnx is created.

next I used onnxruntime-gpu python bindings to create session from the test.onnx file
I know you were using C# bindings, but python is simpler for this exercise.
pip install onnxruntime-gpu

>>> import onnxruntime as ort
>>> ort.get_available_providers()
['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']
>>> sess = ort.InferenceSession('test.onnx', providers=ort.get_available_providers())
2024-09-19 22:10:47.9964085 [W:onnxruntime:Default, onnx_ctx_model_helper.cc:412 onnxruntime::TensorRTCacheModelHandler::ValidateEPCtxNode] It's suggested to set the ORT graph optimization level to 0 
and make "embed_mode" to 0 ("ep_cache_context" is the cache path) for the best model loading time
>>>

it did succeed so the basic mechanism is working.

The error message you encountered comes from

"TensorRT EP could not deserialize engine from binary data");

and happens if the engine is unable to be deserialized.
it's unclear to me why you are encountering that error.
Can you try to repro the steps above to confirm that the basic example is working for you?

@adaber
Copy link
Author

adaber commented Sep 20, 2024

Hi jywu-msft,

Thank you for your willingness to help. It is very appreciated.

first thing I discovered is the gen_trt_engine_wrapper_onnx_model script needs a small modification

Yup. That's what I did, too.

Do you mind trying to create a trt engine using the original onnx file, wrap it and create an inference session on your computer since you got it to work ? I've attached the original onnx file (below).
TestModel.zip

I will try to do the same on my PC.

Thank you again for your help and let me know how it goes.

@adaber
Copy link
Author

adaber commented Sep 20, 2024

I just tried loading my model (not mobilenetv2-12 ) in Python and I got a different error. It said "[ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for EPContext(1) node with name 'EPContext'

Not sure why I got that error. Seems that the TRT Execution Provider might've not been installed properly. I used pip install onnxruntime-gpu.
ort.get_available_providers() shows ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider'] as the providers that are available.

RTX 3090
Trt: 10.4.0
ONNX : 1.16.2
ONNXRuntime: 1.19.2
Win 10

@jywu-msft
Copy link
Member

let me experiment a little more on my side.
in my case I actually built TensorRT EP from source. but onnxruntime-gpu 1.19.2 should have all the changes needed to support EPContext. since thoes changes were added around the 1.18 timeframe.
+@chilo-ms fyi

@adaber
Copy link
Author

adaber commented Sep 20, 2024

@jywu-msft That sounds great. Look forward to hearing back from you. Thanks!

@jywu-msft
Copy link
Member

I tested with the python onnxruntime-gpu 1.19.2 package and it also worked fine.
i'm not sure why you are encountering "Could not find an implementation for EPContext(1) node with name 'EPContext"
can you enable verbose logging with

import onnxruntime as ort
ort.set_default_logger_severity(0)

and share the full output?

@adaber
Copy link
Author

adaber commented Sep 24, 2024

There is a problem with the trt execution provider. I've found a few posts complaining about the same thing.

It says "onnxruntime_pybind_state.cc:490 onnxruntime::python::RegisterTensorRTPluginsAsCustomOps. Please install TensorRT libraries as mentioned in the GPU requirements page, make sure they're in the PATH or LD_LIBRARY_PATH, and your GPU is supported". Then it falls back to the CUDA and CPU execution providers.

Have you gotten a chance to try to create a trt using my ONNX model (TestModel.zip, uploaded 4 days ago in one of my messages), wrap it, and load it as an embedded engine ?

Thanks for the help!

@jywu-msft
Copy link
Member

There is a problem with the trt execution provider. I've found a few posts complaining about the same thing.

It says "onnxruntime_pybind_state.cc:490 onnxruntime::python::RegisterTensorRTPluginsAsCustomOps. Please install TensorRT libraries as mentioned in the GPU requirements page, make sure they're in the PATH or LD_LIBRARY_PATH, and your GPU is supported". Then it falls back to the CUDA and CPU execution providers.

Have you gotten a chance to try to create a trt using my ONNX model (TestModel.zip, uploaded 4 days ago in one of my messages), wrap it, and load it as an embedded engine ?

Thanks for the help!

are you adding the required TensorRT, CUDA and CuDNN libraries in your PATH?

@adaber
Copy link
Author

adaber commented Sep 24, 2024

I have. I can double check if I did it correctly. What's strange, though, is that I don't have to do that when I install TRT and CUDA to run scripts that create TRT engines using ONNX models (FP16, INT8 quantization...). It just works after pip installing the libraries.

Edit: I am using Anaconda (Win 10)

@jywu-msft
Copy link
Member

re: your other question. yes it seems to work

steps i followed
create TRT engine using trtexec from TRT 10.4

trtexec --onnx=TestModel.onnx --minShapes=input:1x1024x128x3 --optShapes=input:1x4096x640x3 --maxShapes=input:1x8000x1400x3 --saveEngine=TestModel.engine

generate ctx Onnx model

C:\tensorrt\TensorRT-10.4.0.26\bin>python gen_trt_engine_wrapper_onnx_model.py -p TestModel.engine -m TestModelEmbed.onnx -e 1
['input']
[<DataType.FLOAT: 0>]
[(1, -1, -1, 3)]
['output']
[<DataType.FLOAT: 0>]
[(1, -1, -1, 1)]
TestModelEmbed.onnx is created.

create session from Embed onnx model using python bindings

C:\tensorrt\TensorRT-10.4.0.26\bin>python
Python 3.11.9 (tags/v3.11.9:de54cf5, Apr  2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import onnxruntime as ort
>>> sess = ort.InferenceSession('TestModelEmbed.onnx', providers=ort.get_available_providers())
2024-09-23 18:31:20.2606211 [W:onnxruntime:Default, onnx_ctx_model_helper.cc:412 onnxruntime::TensorRTCacheModelHandler::ValidateEPCtxNode] It's suggested to set the ORT graph optimization level to 0 and make "embed_mode" to 0 ("ep_cache_context" is the cache path) for the best model loading time
>>>

it all succeeded, so the basic workflow seems to be working.
Note that during trt engine creation, it skipped a bunch of tactics because my gpu device on my test laptop didn't have enough gpu memory, but it was able to create the engine in the end (skipping some of the optimizations)

@jywu-msft
Copy link
Member

I have. I can double check if I did it correctly. What's strange, though, is that I don't have to do that when I install TRT and CUDA to run scripts that create TRT engines using ONNX models (FP16, INT8 quantization...). It just works after pip installing the libraries.

Edit: I am using Anaconda (Win 10)

I just noticed that you are using CUDA 11.8 and CuDNN 8.9 with TensorRT 10.4
the onnxruntime-gpu in pypi requires CUDA 12.x and CuDNN 9.x
if you want to use CUDA 11.8 and CuDNN 8.9
you'll need to install from other index (uninstall the onnxruntime-gpu you installed from pypi) and install
from alt index.

pip install onnxruntime-gpu --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-11/pypi/simple/

@adaber
Copy link
Author

adaber commented Sep 24, 2024

Thanks for the quick responses.

Great to hear you got everything to work. I will try to do the same tests in Python.

I do use CUDA 12, CuDNN 9.x and TRT 10.4 in my C# app. Do you think that someone could try loading that wrapped trt model using ONNX Runtime C# ?

@adaber
Copy link
Author

adaber commented Sep 24, 2024

Also, can you try creating a trt model using FP16 ?

Thanks!

@adaber
Copy link
Author

adaber commented Sep 24, 2024

Just tried to use trtexec instead of my code for creating trt engines and I got the same error using the C# ONNX Runtime wrapper (CUDA 12.6, CuDNN 9.4.0, TRT 10.4.0, ONNX Runtime 1.19.2). Will do a test in Python but I still need to get it working in C#.

Thanks!

@jywu-msft
Copy link
Member

jywu-msft commented Sep 24, 2024

Just tried to use trtexec instead of my code for creating trt engines and I got the same error using the C# ONNX Runtime wrapper (CUDA 12.6, CuDNN 9.4.0, TRT 10.4.0, ONNX Runtime 1.19.2). Will do a test in Python but I still need to get it working in C#.

Thanks!

right. let's confirm the system setup first. it doesn't matter if you're using python, C# , C api's etc. they all end up in the same TRT EP C++ code to deserialize the engine.
i'm trying to test creating an fp16 engine now. will see if someone can help test C# later.

@adaber
Copy link
Author

adaber commented Sep 24, 2024

Sounds good!

@adaber
Copy link
Author

adaber commented Sep 24, 2024

Quick update. I got ONNX Runtime to work in Python and I managed to load my _ctx file (TRT 10.4, CuDNN 9.4.0, CUDA 12.6).

So, I guess, the only thing that's left is to see if you can get the C# wrapper to do the same (TRT 10.4, CuDNN 9.4.0, CUDA 12.6, ONNX Runtime 1.19.2 )

Thank you again for your help and time. It is greatly appreciated.

@jywu-msft
Copy link
Member

Quick update. I got ONNX Runtime to work in Python and I managed to load my _ctx file (TRT 10.4, CuDNN 9.4.0, CUDA 12.6).

So, I guess, the only thing that's left is to see if you can get the C# wrapper to do the same (TRT 10.4, CuDNN 9.4.0, CUDA 12.6, ONNX Runtime 1.19.2 )

Thank you again for your help and time. It is greatly appreciated.

what was your issue with python (it would be good feedback for updating anything on our side to make it easier)

@adaber
Copy link
Author

adaber commented Sep 24, 2024

The problem was that I didn't have all the necessary cuDNN dll files in the folder that's in my PATH. Copying them to that designated folder fixed the issue.

@adaber
Copy link
Author

adaber commented Sep 24, 2024

Another quick update.

I got it working in C++, too.

The C# wrapper, however, refuses to work which doesn't make much sense since it uses the same underlying C++ code. It'd be great if you could look into it when you get a chance. I am eager to find out if there is a problem with the C# wrapper or it's something on my end.

Thanks!

@jywu-msft
Copy link
Member

I tested a simple c# program adapted from https://github.com/microsoft/onnxruntime/tree/main/csharp/sample/Microsoft.ML.OnnxRuntime.ResNet50v2Sample
modified .csproj take dependency on Microsoft.ML.OnnxRuntime.Gpu 1.19.2
and added a few lines to AppendSessionOptions_Tensorrt() and load the embedded context model I created previously.
It all worked fine.
enabling verbose logging also showed the deserialization of engine was successful.

2024-09-24 20:27:13.0867834 [I:onnxruntime:CSharpOnnxRuntime, tensorrt_execution_provider_utils.h:543 onnxruntime::TRTGenerateId] [TensorRT EP] Model name is TestModelEmbedfp16.onnx
2024-09-24 20:27:13.0986277 [I:onnxruntime:CSharpOnnxRuntime, tensorrt_execution_provider.cc:2058 onnxruntime::TensorrtExecutionProvider::GetSubGraph] [TensorRT EP] TensorRT subgraph MetaDef name TRTKernel_graph_trt_engine_wrapper_18103822421215258528_0
2024-09-24 20:27:13.1167376 [W:onnxruntime:CSharpOnnxRuntime, onnx_ctx_model_helper.cc:412 onnxruntime::TensorRTCacheModelHandler::ValidateEPCtxNode] It's suggested to set the ORT graph optimization level to 0 and make "embed_mode" to 0 ("ep_cache_context" is the cache path) for the best model loading time
2024-09-24 20:27:13.2147327 [V:onnxruntime:CSharpOnnxRuntime, onnx_ctx_model_helper.cc:285 onnxruntime::TensorRTCacheModelHandler::GetEpContextFromGraph] [TensorRT EP] Read engine as binary data from "ep_cache_context" attribute of ep context node and deserialized it

As you mentioned this was expected to work since it all ends up in the same C++ code.
it's great that you have C++ working. so it seems like the remaining issue is something with your C# environment.
The strange thing is you are getting a serialization error, which would be coming from TensorRT itself.

@adaber
Copy link
Author

adaber commented Sep 25, 2024

Thanks for letting me know. Can you, please, tell me what TRTSessionOptions you used ?

@jywu-msft
Copy link
Member

jywu-msft commented Sep 25, 2024

Thanks for letting me know. Can you, please, tell me what TRTSessionOptions you used ?

I tested with the basic api which basically leaves all options default. (deviceId = 0)

public void AppendExecutionProvider_Tensorrt(int deviceId = 0)

public void AppendExecutionProvider_Tensorrt(OrtTensorRTProviderOptions trtProviderOptions)

should work too

@adaber
Copy link
Author

adaber commented Sep 25, 2024

Good to know. I am also using the default options. I will investigate things on my end and see what's causing this issue.

Thanks!

@adaber
Copy link
Author

adaber commented Sep 25, 2024

And just to confirm. You are using TRT 10.4, CuDNN 9.4.0, CUDA 12.6, ONNX Runtime 1.19.2, right ?

@jywu-msft
Copy link
Member

Since you have things working for python and c++ , it shouldn't be a dependency issue.
In any case, I'm using:
Microsoft.ML.OnnxRuntime.Gpu 1.19.2
TensorRT 10.4
CUDA 12.5 (this shouldn't matter , as you downloaded the TensorRT 10.4 for Windows version that's compatible with CUDA 12.0-12.6 , correct?)
CuDNN 8.9.7 (but I was told by Nvidia CuDNN 9.x is compatible as well)

CuDNN is now optional for TensorRT.
See the note at https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html#:~:text=cuDNN%20is%20now%20an%20optional%20dependency%20for%20TensorRT,require%20cuDNN%2C%20verify%20that%20you%20have%20it%20installed.
"cuDNN is now an optional dependency for TensorRT and is only used to speed up a few layers. If you require cuDNN, verify that you have it installed. Review the NVIDIA cuDNN Installation Guide for more information. TensorRT 10.4.0 supports cuDNN 8.9.7. cuDNN is not used by the lean or dispatch runtimes."

@adaber
Copy link
Author

adaber commented Sep 25, 2024

I finally got it to work! And it wasn't ONNX Runtime whatsoever lol

So, it seems that there is some kind of TRT version incompatibility going on if I am not mistaken.

The TRT error I get happens when I try to load an embedded engine created by TRT 10.0.3 using ONNX Runtime C# (TRT 10.4) which shouldn't be the case since TRT 10.4 is obviously newer than 10.0.3. However, ONNX Runtime C# (TRT 10.4) can load an embedded engine create by TRT 10.4.

I had used TRT 10.0.3 to create INT8 quantized models some time ago and was trying to load those models embedded in ONNX files using ONNX Runtime C# (TRT 10.4) the whole time lol Even though, I also created a TRT 10.4 engine when you asked me to try it in Python. It never occurred to me that there might be the possibility of TRT 10.4 not to be able to properly deserialize a TRT 10.0.3 engine and hence I didn't change anything in my ONNX Runtime C# test approach.

I ended up doing that out of desperation...and it worked lol

We can close this discussion unless you want to try what I did to see if there is indeed an issue when trying to deserialize an embedded TRT 10.0.3 engine using ONNX Runtime (TRT 10.4). It'd be great if you could update this discussion with your finding.

I've done this so many times with so many other DL frameworks/libraries and I suspected that there was something else going on here. Thank you for your help and patience, jywu-msft.

@jywu-msft
Copy link
Member

jywu-msft commented Sep 25, 2024

oh yes. the TRT version matters. this is why when ORT generates engine files, we encode the TRT version in the filename , to check for compatibility.
since you are using a workflow where you import an engine from native TRT, we aren't able to know what TRT was used to generate it. (thus the generic cannot deserialize error)
please read this section in the TensorRT documentation:
https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#version-compat
"By default, TensorRT engines are compatible only with the version of TensorRT with which they are built. With appropriate build-time configuration, engines that are compatible with later TensorRT versions can be built. TensorRT engines built with TensorRT 8 will also be compatible with TensorRT 9 and TensorRT 10 runtimes, but not vice versa."

you can read more about the recent support they added to relax the version compatibility and the ramifications.

glad to hear you've resolved your issue! happy to help.

@adaber
Copy link
Author

adaber commented Sep 25, 2024

Good to know they've been working on relaxing the version compatibility. Had never had this issue before but now I know lol

Thanks again for helping me solve this issue, jywu!

@adaber adaber closed this as completed Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api:CSharp issues related to the C# API ep:TensorRT issues related to TensorRT execution provider performance issues related to performance regressions
Projects
None yet
Development

No branches or pull requests

3 participants