-
Notifications
You must be signed in to change notification settings - Fork 658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DJL 0.23.0 + torch 2.0.1 GPU multi-threading inference issue #2778
Comments
This is a bug in 0.23.0 container, please set the following environment variable:
See this PR: deepjavalibrary/djl-serving#1073 |
@frankfliu This is of no use to me. I did not use DJL-serving and the containers. I directly used pytorch-engine to call the torchScript model. I tried setting the environment variable I want to know what is the root cause of this problem and the solution. I am worried that my djl version will not be able to continue to be upgraded because of this problem. |
@jestiny0 Did you set thread configuration for PyTorch: https://docs.djl.ai/master/docs/development/inference_performance_optimization.html#thread-configuration? You might also want to take look this: https://docs.djl.ai/master/docs/development/inference_performance_optimization.html#graph-executor-optimization Are you running on GPU? if use CUDA, you must set: Did you try djl-bench? You can run stress test with different PyTorch version |
@frankfliu However, now we are preparing to upgrade our offline models to pytorch 2.0.1. Therefore, I tried to upgrade the online service to the latest DJL 0.23.0+torch2.0.1. It performs well on CPU, but encounters severe performance and memory issues on GPU. I have tried adding |
@frankfliu |
Can you run djl-bench with your model and see if you can reproduce the issue. We didn't observe the performance issue with PyTorch 2.0.1. Can you share your model? |
@frankfliu I did not use djl-bench because my model has complex inputs including multiple inputs and dictionaries. I am using the djl pytorch engine for deployment and serving, but I have not provided the Java code for that part because it is complex. djl configs:
The configuration of the version running well online is as follows:
the stress test latency is shown as follows: The configuration of the version experiencing performance issues is as follows:
the stress test latency is shown as follows: It can be observed that there is an increase in latency, and when the stress test QPS reaches 150, the old version can still run normally, but the new version experiences a significant increase in end-to-end latency. The Java threads are heavily queued, making it unable to run normally. |
@frankfliu Looking forward to your new discoveries and suggestions, thanks! |
@jestiny0 |
@frankfliu
Here is the code for constructing the request. The provided model only has the
code that run the inference:
It is worth mentioning that besides the code for model loading, which only runs once, the code for constructing requests and inference is executed within separate threads(thread pool). This means that different requests are processed concurrently. Besides, the above Java code remains unchanged before and after upgrading the DJL version. |
It's hard for me to figure your input from the code. Can you write code to create empty tensors for the input (like. denseFeaturesIValue: (? , ?) |
@frankfliu Here is the code for generating input in Python. You can refer to it and modify it to use DJL. All features are present in the tensorFeatureIValue. The other three inputs, denseFeaturesIValue, sparseFeaturesIValue, and embeddingFeaturesIValue, are all empty. |
@frankfliu
As we can see, under the 100qps scenario, the latency has increased for the new version. Under the 250qps scenario, the latency has increased significantly, and the GPU is unable to handle the load. This leads to a continuous build-up of requests in the Java thread pool queue, resulting in an infinite increase in end-to-end latency. By the way, the data I provided above is from load testing, not real production traffic. I'm planning to divert a part of real traffic to observe the actual performance. Nonetheless, I'm still looking forward to any further findings or insights you may have to share. |
@frankfliu Now I plan to narrow down the investigation scope by mainly focusing on the upgrade from DJL version 0.22.0 to 0.23.0 and the corresponding pytorch-jni upgrade from 2.0.0-0.22.0 to 2.0.1-0.23.0. I want to check what changes DJL has made between these versions (as it's difficult to troubleshoot PyTorch changes). If you have any further discoveries, I would appreciate it if you could share them |
Description
I have updated DJL to version 0.23.0 and PyTorch to version 2.0.1. However, I encountered an issue with infinite end-to-end latency increase when performing multi-threaded inference using GPU. This seems to be a memory leak.
Currently, I am using DJL 0.22.0 with PyTorch 2.0.0, and I did not encounter any issues in the same stress testing environment.
I have looked into several similar issues, and some of them mentioned issues with both PyTorch 2.0.0 and PyTorch 2.0.1 regarding multi-threaded GPU inference. However, I personally did not encounter any issues with PyTorch 2.0.0.
Furthermore, I tried setting
export TORCH_CUDNN_V8_API_DISABLED=1
based on the referenced issue, but it did not resolve the problem for me.My point
I noticed that the master code of DJL has already set PyTorch 2.0.1 as the default version. I'm curious to know if you have made any modifications to address this issue, or if there are other plans in place?
The text was updated successfully, but these errors were encountered: