You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi everyone! Thanks for the amazing Bento project. I'm currently in the process of switching all my deployed models to bentos. I met such a problem with one of my models (sadtalker video generator).
Handling a single request, I need to generate a video from an image and audio. Unfortunately, I can't fit all my data for one request onto my GPU
My batch loop looks like this:
predictions = await asyncio.gather(
*(self.generator_runner.async_run(
source_image,
kp_source_tensor,
kp_driving_tensor
) for kp_driving_tensor in kp_driving_tensors)
)
Here, source_image, kp_source_tensor, and kp_driving_tensor are batches of tensors.
The generator is a custom PyTorch model. The export to the Bento model looks like this:
When I initialise my service with generator runner it works x1.5 times slower compared with to same code where generator_runner initialised locally or without wrapping generator with BentoML runner.
Initially, all tensors are on the same GPU as the generator model. From what I understand, a lot of time is wasted on data transfer, collecting it from the GPU, and copying it back to the same GPU or something wrong with asynchronous calls to the model runner. What would be the optimal solution for my case?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi everyone! Thanks for the amazing Bento project. I'm currently in the process of switching all my deployed models to bentos. I met such a problem with one of my models (sadtalker video generator).
Handling a single request, I need to generate a video from an image and audio. Unfortunately, I can't fit all my data for one request onto my GPU
My batch loop looks like this:
Here, source_image, kp_source_tensor, and kp_driving_tensor are batches of tensors.
The generator is a custom PyTorch model. The export to the Bento model looks like this:
When I initialise my service with generator runner it works x1.5 times slower compared with to same code where generator_runner initialised locally or without wrapping generator with BentoML runner.
Initially, all tensors are on the same GPU as the generator model. From what I understand, a lot of time is wasted on data transfer, collecting it from the GPU, and copying it back to the same GPU or something wrong with asynchronous calls to the model runner. What would be the optimal solution for my case?
Beta Was this translation helpful? Give feedback.
All reactions