Optimize GPU tensor support for Python backend #293

krishung5 · 2023-08-31T19:34:55Z

Currently, the CUDA IPC calls dominate the time for transferring GPU tensors between processes. Specifically, the functions cudaIpcOpenMemHandle and cudaIpcCloseMemHandle are heavily used. These functions are necessary because the allocated buffers are not in the same pool. So, we need to call these functions to open an interprocess memory handle exported from another process and get a device pointer that can be used in the local process.

This PR makes use of Triton's CUDA shared memory pool for GPU tensor transfers. The parent process will get the base address of the CUDA pool and share it with the stub. Here's how the data transfer process works:

Data transfer from parent to stub
When the parent process wants to send a tensor to the stub using the pool, it stores the data, calculates the offset, and shares this offset with the stub. The stub process then uses this offset to retrieve the data.
Data transfer from stub to parent
Because only the parent process can interact with the CUDA pool memory allocation, the stub first notifies the parent about the byte size of the data using an IPC message. Then, the parent pre-allocates a buffer from the memory pool and communicates the calculated offset details back to the stub. Afterward, the stub fills the buffer with the tensor data and notifies the parent once the task is done.

Testing: triton-inference-server/server#6276

Tabrizian · 2023-09-06T19:14:35Z

Discussed with @krishung5 offline. Looks like there is one additional data copy that is introduced as a part of this change that is affecting the single model latency and throughput. @krishung5 is working on removing that extra copy and gather profiling numbers again.

src/python_be.cc

src/shm_manager.h

src/stub_launcher.h

src/python_be.cc

src/pb_memory.h

src/pb_memory.cc

src/pb_stub.cc

krishung5 · 2023-10-05T10:09:09Z

Updated the functionality of PbMemory - when creating a PbMemory object, it only sets the cuda_pool_offset if the data is allocated from the cuda pool; when loading PbMemory from memory, it'll return a pointer based on the offset. No extra data copy/logic happens inside PbMemory.

Summarize the logic for GPU tensor transfer below in case it could be helpful for reviewing.

Different cases for the GPU tensor transfer:

Inference

input
- non-decoupled - ~~Get the input buffer from input_collector. If the buffer is not using cuda pool, create a new BackendMemory from the pool and copy the input data from the original buffer to the new one.~~ The input_collector will allocate the buffer using GPU pool first using BackendMemory. Won't need any extra handling here.
- decoupled - Allocate the GPU memory using BackendMemory, and call backend::ReadInputTensor to read input tensor to the buffer.
output
- non-decoupled - In InferResponse::Send function, get the Triton-provided output buffer. If the buffer is not using cuda pool, try to allocate a new buffer from the pool, and add the new buffer to gpu_buffer_helper. Will need to copy the output tensor back to the Triton-provided buffer once the stub fills in the buffer.
- decoupled - Same as the non-decoupled case. The final copy happens here.

BLS

input - No differences between decoupled and non-decoupled cases. Allocate memory using BackendMemory and add the buffer to gpu_buffer_helper.
output - The request_executor now uses BackendMemory to allocate buffer for BLS output. Both non-decoupled and decoupled cases call ModelInstanceState::PrepareResponseHandle to prepare the response. It's possible that the cuda memory pool hasn't been shared with the stub process at the time the BLS output is allocated during the callback, and there is no way to share the cuda pool with the stub since we are not passing the StubLauncher object to the ResponseAlloc callback (thought it would be more complicated if we do so, but open to any ideas!). Hence, update the cuda pool offset here after the associated PbMemory is created.

src/infer_response.cc

src/python_be.cc

Tabrizian

Great work, Kris!

src/memory_manager.cc

src/python_be.cc

…sses

… performance.

…hen needed

…mory allocation

krishung5 force-pushed the krish-python-gpu branch from d4220a5 to 5411b18 Compare September 1, 2023 17:42

krishung5 marked this pull request as ready for review September 1, 2023 17:56

krishung5 force-pushed the krish-python-gpu branch from 5411b18 to 6dc4fa6 Compare September 1, 2023 18:05

krishung5 requested review from Tabrizian and tanmayv25 September 1, 2023 19:10

krishung5 force-pushed the krish-python-gpu branch 2 times, most recently from 439c936 to 9d02cc0 Compare September 7, 2023 12:21

krishung5 mentioned this pull request Sep 7, 2023

Test Python BLS with different sizes of CUDA memory pool triton-inference-server/server#6276

Merged

Tabrizian reviewed Sep 8, 2023

View reviewed changes

krishung5 requested a review from Tabrizian September 12, 2023 08:55

Tabrizian reviewed Sep 13, 2023

View reviewed changes

src/pb_memory.cc Outdated Show resolved Hide resolved

src/pb_memory.cc Outdated Show resolved Hide resolved

src/pb_stub.cc Outdated Show resolved Hide resolved

krishung5 force-pushed the krish-python-gpu branch from ec29147 to 1aca3bf Compare September 21, 2023 01:36

krishung5 force-pushed the krish-python-gpu branch from ebef5c9 to 42f80fc Compare October 5, 2023 08:54

krishung5 requested a review from Tabrizian October 5, 2023 10:09

Tabrizian reviewed Oct 11, 2023

View reviewed changes

src/infer_response.cc Outdated Show resolved Hide resolved

src/python_be.cc Outdated Show resolved Hide resolved

src/python_be.cc Show resolved Hide resolved

krishung5 force-pushed the krish-python-gpu branch from 6b3f892 to e83c1f9 Compare October 18, 2023 19:46

krishung5 requested a review from Tabrizian October 18, 2023 19:48

Tabrizian reviewed Oct 19, 2023

View reviewed changes

src/memory_manager.cc Outdated Show resolved Hide resolved

src/python_be.cc Show resolved Hide resolved

Tabrizian approved these changes Oct 23, 2023

View reviewed changes

krishung5 added 9 commits October 24, 2023 18:53

Use CUDA shared memory pool to optimize tensor transfer between proce…

439b516

…sses

Fix up: use the data ptr to get the tensor

bf7e9ca

Remove extra data copy. Use cudaMemcpyAsync for GPU output to improve…

641b7bc

… performance.

Fix error handling. Fix bls tensor lifetime

e18304c

Move CUDAMemPoolMessage inside TRITON_ENABLE_GPU directive

40c3242

Fix CPU build

a828b20

Address comments

c930e46

Fix GPU tensor lifecycle

883d85e

Make it be able to share cuda pool on different devices to the stub w…

dabf96a

…hen needed

krishung5 added 7 commits October 24, 2023 18:53

Remove data copy from PbMemory class

a8b3f70

Fix up syntax, remove unused comments

ae3a243

Simplify PbMemory functionality. Let different io cases handle the me…

7b8bae7

…mory allocation

Remove duplicated logic

1c96434

Address comments

8bde7ce

Address comment

353a5ed

Fix CPU only build

53fedbc

krishung5 force-pushed the krish-python-gpu branch from a100690 to 53fedbc Compare October 25, 2023 01:53

krishung5 merged commit 4c0a977 into main Oct 25, 2023
3 checks passed

krishung5 deleted the krish-python-gpu branch October 25, 2023 22:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize GPU tensor support for Python backend #293

Optimize GPU tensor support for Python backend #293

krishung5 commented Aug 31, 2023 •

edited

Loading

Tabrizian commented Sep 6, 2023

krishung5 commented Oct 5, 2023 •

edited

Loading

Tabrizian left a comment •

edited

Loading

Optimize GPU tensor support for Python backend #293

Optimize GPU tensor support for Python backend #293

Conversation

krishung5 commented Aug 31, 2023 • edited Loading

Tabrizian commented Sep 6, 2023

krishung5 commented Oct 5, 2023 • edited Loading

Tabrizian left a comment • edited Loading

Choose a reason for hiding this comment

krishung5 commented Aug 31, 2023 •

edited

Loading

krishung5 commented Oct 5, 2023 •

edited

Loading

Tabrizian left a comment •

edited

Loading