Batch runner inference: data transfer is slow #4509

l11ama · 2024-02-19T11:17:23Z

l11ama
Feb 19, 2024

Hi everyone! Thanks for the amazing Bento project. I'm currently in the process of switching all my deployed models to bentos. I met such a problem with one of my models (sadtalker video generator).
Handling a single request, I need to generate a video from an image and audio. Unfortunately, I can't fit all my data for one request onto my GPU
My batch loop looks like this:

predictions = await asyncio.gather(
    *(self.generator_runner.async_run(
        source_image,
        kp_source_tensor,
        kp_driving_tensor
    ) for kp_driving_tensor in kp_driving_tensors)
)

Here, source_image, kp_source_tensor, and kp_driving_tensor are batches of tensors.
The generator is a custom PyTorch model. The export to the Bento model looks like this:

bentoml.pytorch.save_model(
    "sadtalker_generator", generator,
    signatures={"__call__": {
        "batchable": True, "batch_dim": 0
    }}
)

When I initialise my service with generator runner it works x1.5 times slower compared with to same code where generator_runner initialised locally or without wrapping generator with BentoML runner.
Initially, all tensors are on the same GPU as the generator model. From what I understand, a lot of time is wasted on data transfer, collecting it from the GPU, and copying it back to the same GPU or something wrong with asynchronous calls to the model runner. What would be the optimal solution for my case?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BentoML

Batch runner inference: data transfer is slow #4509

{{title}}

Replies: 0 comments

Select a reply

BentoML

Batch runner inference: data transfer is slow #4509

l11ama Feb 19, 2024

Replies: 0 comments

l11ama
Feb 19, 2024