Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding DtoD support to CUDA shared memory feature #58

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rarzumanyan
Copy link

Hello world,

This patch:

  • Adds DtoD CUDA memcopy support which allows to pass video frames decoded in CUDA memory to Triton.
  • Changes API of set_shared_memory_region function to accept CUDA memory beside numpy arrays.

I'm not certain about changes in Python module API so please free to give me guidance on how to implement it better.

@tanmayv25
Copy link
Contributor

@rarzumanyan Can you describe how your changes work? For device=gpu, what is the input_values for set_shared_memory_region?

I would suggest adding an example demonstrating the DtoD copies similar to simple_http_cudasm_client. You can can have some transitive flow of data on the GPU. It will be useful in testing and understanding the API as well.

@rarzumanyan
Copy link
Author

Hi @tanmayv25

Can you describe how your changes work?

Sure.
The main change in the patch is the CudaSharedMemoryRegionSetDptr function which acts very similar to CudaSharedMemoryRegionSet.

The only difference between them is that CudaSharedMemoryRegionSet performs HtoD CUDA memcpy, while CudaSharedMemoryRegionSetDptr does DtoD CUDA memcpy.

For device=gpu, what is the input_values for set_shared_memory_region?

When device==gpu, set_shared_memory_region function expects input pointer to be a memory region in vRAM, allocated with CUDA driver or runtime API.

I would suggest adding an example demonstrating the DtoD copies

Makes total sense, I just want to make sure you're happy with proposed API changes first.
As soon as this is sorted out, I'll add one more patch with sample to this PR.

@rarzumanyan
Copy link
Author

Hi @CoderHam

Why not use a device flag and condition on that to use cudaMemcpyDeviceToDevice or cudaMemcpyHostToDevice?

That's perfectly fine by me to add extra flag. However, I didn't add it due for following reasons:

  • Didn't want to contaminate namespace with yet-another enum for device identification and export it to Python.
  • cudaMemcpy has enum to identify the memcpy direction but not all values are relevant.
  • Integer or boolean flag isn't easy to follow.

That's why I've decided to introduce another function for that.
If you want CudaSharedMemoryRegionSet signature to be extended with device flag anyway - please let me know, I'll do that.

@CoderHam CoderHam changed the title Adding DtoD cupport to CUDA shared memory feature Adding DtoD support to CUDA shared memory feature Feb 16, 2022
@CoderHam
Copy link
Contributor

@rarzumanyan on second thought your arguments for not contaminating the namespace / enum are valid. You can just make the new changes I suggested for handling the input_values field specially for the case where it is a (cuda) device pointer

_raise_error(
"unsupported device type: cpu and gpu are supported only"
)

offset_current = 0
for input_value in input_values:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the input_values a list of cudaptr or a single cudaptr? I think a single cuda device ptr is suffcient.

Copy link
Author

@rarzumanyan rarzumanyan Feb 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a single cuda device ptr.

@@ -173,20 +181,27 @@ def set_shared_memory_region(cuda_shm_handle, input_values):
"input_values must be specified as a list/tuple of numpy arrays"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to expand checks here to allow for a single cuda device ptr instead of a list or tuple of numpy arrays

def set_shared_memory_region(cuda_shm_handle, input_values):
"""Copy the contents of the numpy array into the cuda shared memory region.
def set_shared_memory_region(cuda_shm_handle, input_values, device='cpu'):
"""Copy the contents of the numpy array or cuda dptr into the cuda shared memory region.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction: "Copy the contents of the list/tuple of numpy arrays or the content pointed to by a cuda device pointer into the cuda shared memory region."

@matthewkotila
Copy link
Contributor

matthewkotila commented Nov 3, 2022

@tanmayv25 who would be a good person to task with reviewing the remainder of this code/review? I'm not familiar with this kind of code.

@matthewkotila
Copy link
Contributor

@jbkyang-nvi, @tanmayv25 has suggested you might be a good person to request a review from for this code. Can you help with that?

@matthewkotila
Copy link
Contributor

@rarzumanyan are you able to add testing for this feature?

@muqishan
Copy link

Hello, I also encountered the same problem. In one of my face recognition projects, due to the large amount of data, batch>64, and the recognition model depends on the results of the detection model, so I will detect the results of the model (need nms etc.) are processed on the GPU (cupy, torch), but due to triton’s official cudashm.set_shared_memory_region(shm_input_handle, [img]), the img object can only be numpy, so I have to move the post-processed data Go back to the CPU. Until I came across your article, I have been doing this, because I am a novice in C++, I don't know how to modify the code, so that the cudashm of python's tritonclient supports uploading tensor data on the GPU. I hope you can tell me how I can use your branch to build a python module so that my python program can send the tensor on the GPU to triton? Thanks a lot;Thanks a lot;Thanks a lot.@rarzumanyan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants