-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize GPU tensor support for Python backend #293
Conversation
d4220a5
to
5411b18
Compare
5411b18
to
6dc4fa6
Compare
Discussed with @krishung5 offline. Looks like there is one additional data copy that is introduced as a part of this change that is affecting the single model latency and throughput. @krishung5 is working on removing that extra copy and gather profiling numbers again. |
439c936
to
9d02cc0
Compare
ec29147
to
1aca3bf
Compare
ebef5c9
to
42f80fc
Compare
Updated the functionality of Summarize the logic for GPU tensor transfer below in case it could be helpful for reviewing. Different cases for the GPU tensor transfer: Inference
BLS
|
6b3f892
to
e83c1f9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, Kris!
a100690
to
53fedbc
Compare
Currently, the CUDA IPC calls dominate the time for transferring GPU tensors between processes. Specifically, the functions cudaIpcOpenMemHandle and cudaIpcCloseMemHandle are heavily used. These functions are necessary because the allocated buffers are not in the same pool. So, we need to call these functions to open an interprocess memory handle exported from another process and get a device pointer that can be used in the local process.
This PR makes use of Triton's CUDA shared memory pool for GPU tensor transfers. The parent process will get the base address of the CUDA pool and share it with the stub. Here's how the data transfer process works:
When the parent process wants to send a tensor to the stub using the pool, it stores the data, calculates the offset, and shares this offset with the stub. The stub process then uses this offset to retrieve the data.
Because only the parent process can interact with the CUDA pool memory allocation, the stub first notifies the parent about the byte size of the data using an IPC message. Then, the parent pre-allocates a buffer from the memory pool and communicates the calculated offset details back to the stub. Afterward, the stub fills the buffer with the tensor data and notifies the parent once the task is done.
Testing: triton-inference-server/server#6276