Adding DtoD support to CUDA shared memory feature #58

rarzumanyan · 2022-01-21T11:42:41Z

Hello world,

This patch:

Adds DtoD CUDA memcopy support which allows to pass video frames decoded in CUDA memory to Triton.
Changes API of set_shared_memory_region function to accept CUDA memory beside numpy arrays.

I'm not certain about changes in Python module API so please free to give me guidance on how to implement it better.

tanmayv25 · 2022-01-24T22:12:52Z

@rarzumanyan Can you describe how your changes work? For device=gpu, what is the input_values for set_shared_memory_region?

I would suggest adding an example demonstrating the DtoD copies similar to simple_http_cudasm_client. You can can have some transitive flow of data on the GPU. It will be useful in testing and understanding the API as well.

rarzumanyan · 2022-01-25T13:55:31Z

Hi @tanmayv25

Can you describe how your changes work?

Sure.
The main change in the patch is the CudaSharedMemoryRegionSetDptr function which acts very similar to CudaSharedMemoryRegionSet.

The only difference between them is that CudaSharedMemoryRegionSet performs HtoD CUDA memcpy, while CudaSharedMemoryRegionSetDptr does DtoD CUDA memcpy.

For device=gpu, what is the input_values for set_shared_memory_region?

When device==gpu, set_shared_memory_region function expects input pointer to be a memory region in vRAM, allocated with CUDA driver or runtime API.

I would suggest adding an example demonstrating the DtoD copies

Makes total sense, I just want to make sure you're happy with proposed API changes first.
As soon as this is sorted out, I'll add one more patch with sample to this PR.

src/python/library/tritonclient/utils/cuda_shared_memory/cuda_shared_memory.cc

rarzumanyan · 2022-01-26T10:13:00Z

Hi @CoderHam

Why not use a device flag and condition on that to use cudaMemcpyDeviceToDevice or cudaMemcpyHostToDevice?

That's perfectly fine by me to add extra flag. However, I didn't add it due for following reasons:

Didn't want to contaminate namespace with yet-another enum for device identification and export it to Python.
cudaMemcpy has enum to identify the memcpy direction but not all values are relevant.
Integer or boolean flag isn't easy to follow.

That's why I've decided to introduce another function for that.
If you want CudaSharedMemoryRegionSet signature to be extended with device flag anyway - please let me know, I'll do that.

CoderHam · 2022-02-16T22:27:18Z

@rarzumanyan on second thought your arguments for not contaminating the namespace / enum are valid. You can just make the new changes I suggested for handling the input_values field specially for the case where it is a (cuda) device pointer

CoderHam · 2022-02-16T22:21:04Z

src/python/library/tritonclient/utils/cuda_shared_memory/__init__.py

+        _raise_error(
+            "unsupported device type: cpu and gpu are supported only"
+        )
+
    offset_current = 0
    for input_value in input_values:


Is the input_values a list of cudaptr or a single cudaptr? I think a single cuda device ptr is suffcient.

It's a single cuda device ptr.

CoderHam · 2022-02-16T22:21:49Z

src/python/library/tritonclient/utils/cuda_shared_memory/__init__.py

@@ -173,20 +181,27 @@ def set_shared_memory_region(cuda_shm_handle, input_values):
                "input_values must be specified as a list/tuple of numpy arrays"


Need to expand checks here to allow for a single cuda device ptr instead of a list or tuple of numpy arrays

CoderHam · 2022-02-16T22:23:38Z

src/python/library/tritonclient/utils/cuda_shared_memory/__init__.py

-def set_shared_memory_region(cuda_shm_handle, input_values):
-    """Copy the contents of the numpy array into the cuda shared memory region.
+def set_shared_memory_region(cuda_shm_handle, input_values, device='cpu'):
+    """Copy the contents of the numpy array or cuda dptr into the cuda shared memory region.


Correction: "Copy the contents of the list/tuple of numpy arrays or the content pointed to by a cuda device pointer into the cuda shared memory region."

matthewkotila · 2022-11-03T22:08:22Z

@tanmayv25 who would be a good person to task with reviewing the remainder of this code/review? I'm not familiar with this kind of code.

matthewkotila · 2022-11-28T17:44:00Z

@jbkyang-nvi, @tanmayv25 has suggested you might be a good person to request a review from for this code. Can you help with that?

matthewkotila · 2022-11-28T17:44:27Z

@rarzumanyan are you able to add testing for this feature?

muqishan · 2023-05-10T09:45:41Z

Hello, I also encountered the same problem. In one of my face recognition projects, due to the large amount of data, batch>64, and the recognition model depends on the results of the detection model, so I will detect the results of the model (need nms etc.) are processed on the GPU (cupy, torch), but due to triton’s official cudashm.set_shared_memory_region(shm_input_handle, [img]), the img object can only be numpy, so I have to move the post-processed data Go back to the CPU. Until I came across your article, I have been doing this, because I am a novice in C++, I don't know how to modify the code, so that the cudashm of python's tritonclient supports uploading tensor data on the GPU. I hope you can tell me how I can use your branch to build a python module so that my python program can send the tensor on the GPU to triton? Thanks a lot；Thanks a lot;Thanks a lot.@rarzumanyan

Adding DtoD cupport to CUDA shared memory feature

9383abd

CoderHam suggested changes Jan 25, 2022

View reviewed changes

src/python/library/tritonclient/utils/cuda_shared_memory/cuda_shared_memory.cc Show resolved Hide resolved

CoderHam changed the title ~~Adding DtoD cupport to CUDA shared memory feature~~ Adding DtoD support to CUDA shared memory feature Feb 16, 2022

CoderHam suggested changes Feb 16, 2022

View reviewed changes

jbarap mentioned this pull request May 25, 2023

Adding DtoD support to CUDA shared memory feature 2 #327

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding DtoD support to CUDA shared memory feature #58

Adding DtoD support to CUDA shared memory feature #58

rarzumanyan commented Jan 21, 2022

tanmayv25 commented Jan 24, 2022

rarzumanyan commented Jan 25, 2022

rarzumanyan commented Jan 26, 2022

CoderHam commented Feb 16, 2022

CoderHam Feb 16, 2022

rarzumanyan Feb 18, 2022 •

edited

Loading

CoderHam Feb 16, 2022

CoderHam Feb 16, 2022

matthewkotila commented Nov 3, 2022 •

edited

Loading

matthewkotila commented Nov 28, 2022

matthewkotila commented Nov 28, 2022

muqishan commented May 10, 2023

		@@ -173,20 +181,27 @@ def set_shared_memory_region(cuda_shm_handle, input_values):
		"input_values must be specified as a list/tuple of numpy arrays"

Adding DtoD support to CUDA shared memory feature #58

Are you sure you want to change the base?

Adding DtoD support to CUDA shared memory feature #58

Conversation

rarzumanyan commented Jan 21, 2022

tanmayv25 commented Jan 24, 2022

rarzumanyan commented Jan 25, 2022

rarzumanyan commented Jan 26, 2022

CoderHam commented Feb 16, 2022

CoderHam Feb 16, 2022

Choose a reason for hiding this comment

rarzumanyan Feb 18, 2022 • edited Loading

Choose a reason for hiding this comment

CoderHam Feb 16, 2022

Choose a reason for hiding this comment

CoderHam Feb 16, 2022

Choose a reason for hiding this comment

matthewkotila commented Nov 3, 2022 • edited Loading

matthewkotila commented Nov 28, 2022

matthewkotila commented Nov 28, 2022

muqishan commented May 10, 2023

rarzumanyan Feb 18, 2022 •

edited

Loading

matthewkotila commented Nov 3, 2022 •

edited

Loading