How do we scale runner independently of the api server without Yatai? #3303

cadmusthefounder · 2022-12-03T07:36:03Z

cadmusthefounder
Dec 3, 2022

In the docs, it is mentioned that runners are

a unit of computation that can be executed on a remote Python worker and scales independently

and that

while the standalone BentoServer schedules Runner workers on their own Python processes, the BentoDeployment created by Yatai, scales Runner workers in their own group of pods and made it possible to set a different resource requirement for each Runner, and auto-scaling each Runner separately based on their workloads.

Based on the above and the architecture diagram, it seems possible to scale the api servers and runners separately in different pods. I am confused about how they communicate and how data is being transferred between them.

I understand that the recommended way is using Yatai, but since it requires additional resources and that we already have a horizontal scaling solution in place, I was wondering if there is a more direct way of defining the service and runner resources.

Answered by KimSoungRyoul

Dec 6, 2022

as far as I know there is no option to increase runner process cnt

but here is one way what you want to do

bentoml support below cli

bentoml start-runner-server
bentoml start-http-server
bentoml start-grpc-server

It is possible to deploy in the same way as the archiecture (not unix socket)

referring to the method below, runner container and http-server can be distributed separately in separate pod,
and runner container can be scaled up


docker run -d  --name iris-model1-runner --network="bento-test-network"  -p 3001:3000 iris-bento:latest start-runner-server --runner-name iris-model1

docker run -d  --name iris-model2-runner --network="bento-test-network"  -p 3001:3000 iris-bento:latest…

View full answer

KimSoungRyoul · 2022-12-06T03:39:27Z

KimSoungRyoul
Dec 6, 2022

as far as I know there is no option to increase runner process cnt

but here is one way what you want to do

bentoml support below cli

bentoml start-runner-server
bentoml start-http-server
bentoml start-grpc-server

It is possible to deploy in the same way as the archiecture (not unix socket)

referring to the method below, runner container and http-server can be distributed separately in separate pod,
and runner container can be scaled up


docker run -d  --name iris-model1-runner --network="bento-test-network"  -p 3001:3000 iris-bento:latest start-runner-server --runner-name iris-model1

docker run -d  --name iris-model2-runner --network="bento-test-network"  -p 3001:3000 iris-bento:latest start-runner-server --runner-name iris-model1


docker run -d --name  iris-http-server --network="bento-test-network" -p 3003:3000 iris-bento:latest  start-http-server --bind tcp://0.0.0.0:3000 --runner-map '{"iris-model1":"tcp://iris-model1-runner:3000","iris-model2":"tcp://iris-model1-runner:3000"}' --working-dir .

5 replies

hmbui-noze Feb 23, 2023

Thank you for this answer. May I ask another question? As far as I know, we can spawn several api workers, but we have no option to choose the number of runner instances for a model, so is it safe to assume there is always only one instance? Is it a bottleneck here? Especially in case we have a powerful gpu with a large memory?

KimSoungRyoul Mar 26, 2023

Hi I saw your comment too late, but I'm posting this in hopes that someone else will see it and find it helpful.

> we have no option to choose the number of runner instances for a model,

The performance of BentoML's Runner is determined by the ML Framework used under the hood. The Runnable class of BentoML specifies the physical features that each ML Framework(ex: pytorch, tensorflow, sklearn...) supports.

# bentoml/_internal/frameworks/common/pytorch.py
class PytorchModelRunnable(bentoml.Runnable):
    SUPPORTED_RESOURCES = ("nvidia.com/gpu", "cpu")
    SUPPORTS_CPU_MULTI_THREADING = True # if this is true, Runner process is not forked , 
    # therefore pytorch model Runner is always spawned as a single process

For example, Pytorch's torchscript (cpu) inference supports multi-threading.
This means that BentoML does not need to create more than one process when running a pytorch Model. (because it does the multithreading internally)

show cpu threading torchscript inference

If you want to control this option, you can write something like this

import torch.nn as nn
import torch

class IrisModule(nn.Module):
    def __init__(self):
        torch.set_num_threads(4)

    def forward(...):
            ....

iris_model = IrisModel(...)

bentoml.pytorch.save_model(iris_model)

However, it is not recommended to force the number of threads to be fixed in this way, as pytorch will calculate the optimal number of threads from the physical resources it has at hand.

AFAIK all recognized ML frameworks support multithreading for models. (Someone please correct me if I'm wrong...)

Because of this, increasing the number of Runners inside the same physical instance (like an EC2 or Container) does not help Inference performance, and having only one Runner Instance is not a bottleneck. (because it does multithreading calculate internally).

The example above is for CPU inference, but the same is true for GPUs.

The performance of the BentoML runner is determined by the ML Framework used, so it depends on what you write in the forward() method when you write the pytorch Module like below.

 import torch
import torch.nn as nn

class GPUInferenceExampleModule(nn.Module):
    Runner. def __init__(self, model_path):
        super().__init__()
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model = torch.load(model_path).to(self.device)
        # something set process count() ...
        
    def forward(self, input_data):
        # runner performace depend on what you do in forward method() !!!!
        input_data = input_data.to(self.device)

        with torch.no_grad():
            output = self.model(input_data)

        output = output.to('cpu')
        
        return output

model = GPUInferenceExampleModule()

bentoml.pytorch.save_model(
      model
)

> Especially in case we have a powerful gpu with a large memory?

If you have single instance which has powerful gpu with a large memory

you can use all computing Resource only run with --production option. (ex: bentoml serve service:svc --production)

bento's automatically set api worker's cnt & runner's thread cnt which depend on physical instance's cpu count or gpu(these case depend on what you do in GPUInferenceExampleModule)

hmbui-noze Mar 31, 2023

Really appreciate your detailed answer. I understand what you are saying. But I feel like sometimes, multithreading is not enough, we also need multi-processing to increase throughput even further. It's like we have several instances of the same model residing on GPU receiving inputs from load balancer.
I can see that in Triton we can do that The Triton architecture allows multiple models and/or multiple instances of the same model to execute in parallel on the same system and I wonder that we can do the same here.

From what I understand, if I want to do that in BentoML, I have to create several services with different URI which point to the same model and prepare the load balance myself

AncientRemember Apr 18, 2023

If you want, just mout bentoml on fastapi!
you can also deploy distributed behind a load balance hub,multi process or multi node

if gpu resource is limited,The GPU is the bottleneck in most cases

KimSoungRyoul Apr 22, 2023

@hmbui-noze
hi long time no see, I understand what you want (maybe?)

old version: when bentoml did not support Resource Scheduling Strategy

I hope this example is useful to you

mnist_runner1 = bentoml.pytorch.get("pytorch_mnist:latest").to_runner(name="pytorch_mnist1") # spawned proccess 1
mnist_runner2 = bentoml.pytorch.get("pytorch_mnist:latest").to_runner(name="pytorch_mnist2") # spawned proccess 2

svc = bentoml.Service(name="pytorch_mnist_demo", runners=[mnist_runner1, mnist_runner2])


def to_numpy(tensor):
    return tensor.detach().cpu().numpy()

@svc.api(input=Image(), output=NumpyNdarray(dtype="int64"))
async def predict_image(f: PILImage) -> NDArray[t.Any]:
    arr = np.expand_dims(arr, (0, 1)).astype("float32")
    
    #write code to distribute input

    # parallel execution 
    results = await asyncio.gather( # (more precisely, it's non-blocking that looks like parallelism.)
        mnist_runner1.async_run(arr), # model will determine whether or not it can be parallelized. ( set to occupy GPU[0])
        mnist_runner2.async_run(arr) # :)  (set to occupy GPU[1])
    )

    runner1_result, runner2_result = results  # [tensor([5]), tensor([5])]

    bentoml_logger.info(f"111111111: {to_numpy(runner1_result)[0]}")
    bentoml_logger.info(f"2222222: {to_numpy(runner2_result)[0]}")

    return to_numpy(runner1_result)

current : bentoml supports Resource Scheduling Strategy

you can manage runner process count with bento_configuration.yaml

https://docs.bentoml.com/en/latest/guides/scheduling.html <-- see more detail

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BentoML

How do we scale runner independently of the api server without Yatai? #3303

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How do we scale runner independently of the api server without Yatai? #3303

Replies: 1 comment · 5 replies

> we have no option to choose the number of runner instances for a model,

If you want to control this option, you can write something like this

The performance of the BentoML runner is determined by the ML Framework used, so it depends on what you write in the forward() method when you write the pytorch Module like below.

> Especially in case we have a powerful gpu with a large memory?

old version: when bentoml did not support Resource Scheduling Strategy

current : bentoml supports Resource Scheduling Strategy

Replies: 1 comment 5 replies