shadow model inference #1051

jiyer2016 · 2020-09-01T12:47:32Z

jiyer2016
Sep 1, 2020

I can imagine "packing" two models into my Service.
One of the models acts as primary - It's inference will be returned to API consumers.
The other acts as the "shadow model" - whose inference is only logged somewhere - API consumers never see the results.

The questions I have are:

Is there anyway I can ensure that the "shadow model" runs in a background low-priority thread ? perhaps even after the primary has model's response has been returned to the consumer ?
Would I be able to leverage Bento's logging mechanism to record inferences from both primary and secondary models for downsteam analysis ?

Answered by yubozhao

Sep 1, 2020

It's pretty great to see you guys are doing more advanced ML serving use cases.

Right now, I think it is challenging to do shadow inference well with BentoService.

I don't think BentoML has out of box capability to mark models run in a low-priority thread. @parano might have a good idea for this. I am going to defer to him
You can leverage BentoML's logging when you deploy them in the same services.

class MyService(BentoService):
    def prediction_current(self, df):
          return self.models.current_model.predict(df)

    def prediction_shadow(self, df):
          return self.models.shadow_model.predict(df)

    @api(input=DataframeInput(), batch=True)
    def predict(self, df):…

View full answer

yubozhao · 2020-09-01T18:44:34Z

yubozhao
Sep 1, 2020

It's pretty great to see you guys are doing more advanced ML serving use cases.

Right now, I think it is challenging to do shadow inference well with BentoService.

I don't think BentoML has out of box capability to mark models run in a low-priority thread. @parano might have a good idea for this. I am going to defer to him
You can leverage BentoML's logging when you deploy them in the same services.

class MyService(BentoService):
    def prediction_current(self, df):
          return self.models.current_model.predict(df)

    def prediction_shadow(self, df):
          return self.models.shadow_model.predict(df)

    @api(input=DataframeInput(), batch=True)
    def predict(self, df):
          current_prediction = self.prediction_current(df)
          shadow_prediction = self.predict_shadow(df)
          return [{result: current_prediction, shadow_result: shadow_prediction}]

One of the challenges to do shadowing with separate BentoServices is how to join the request logs for comparing prediction results to evaluate models. Right now each BentoServices generate a request_id internally. I think we should have the ability to take in request_id from the request header and use the internally generated as a backup. This change should make compare prediction logs a lot easier.

Once we implement the feature that API service use request_id in the header, if available. Shadow inferencing will be a lot easier. For Kubernetes deployment, my recommendation is using the existing mirror traffic management in Istio or similar functionality from other service mesh tooling, pack, and deploy the models in separate BentoService.
With Istio's mirror function, You can take advantage of the BentoML's performance without worry about the delayed cause by inferencing with the "shadow model". And by setting the request_id in the header, you can easily join the prediction logs from both services for comparison.

2 replies

parano Sep 2, 2020
Maintainer

Agree with @yubozhao for running online shadow deployment, it is better to use traffic mirroring tools that are already available in the SDE/DevOps world. You can build a shadow BentoService and deploy it separately in a separate container, and configure it to receive the same traffic as the primary model.

If the goal was to compare the performance of a new model with an existing deployed model, we are actually building something that might help make this development workflow even easier for Data Science. It is in the upcoming adapter refactoring #1002 that @bojiang and @akainth015 are working on, where we formalized how an inference task is converted to a log record and how do we get an inference task back from a log record.

This will allow users to easily re-play the production traffic by loading from the log files, on a local development machine or a CI/CD environment. In your case, if the latency of the shadowed model is not a concern, another way to do this is to schedule an hourly or daily job that downloads the prediction logs and runs a batch inferencing job with the shadow model and store the output.

jiyer2016 Sep 2, 2020
Author

Thanks @parano and @yubozhao - I like the idea of NOT deploying the shadow model - and instead have an offline process perform the evaluation by replaying the production logs - and then doing the comparison.

withsmilo · 2020-09-02T02:19:33Z

withsmilo
Sep 2, 2020
Maintainer

That's a great discussion point! Our team operates a API server in front of one service model and multiple shadow models when it is necessary to compare the performance of the models.

2 replies

jiyer2016 Sep 2, 2020
Author

Thanks, what API Server does your team use ?

withsmilo Sep 2, 2020
Maintainer

@jiyer2016 : We implemented it using native Java, and it is purely for the purpose of applying our service requirements(A/B test, common business logic, traffic mirroring, ...).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BentoML

shadow model inference #1051

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

BentoML

shadow model inference #1051

jiyer2016 Sep 1, 2020

Replies: 2 comments · 4 replies

yubozhao Sep 1, 2020

parano Sep 2, 2020 Maintainer

jiyer2016 Sep 2, 2020 Author

withsmilo Sep 2, 2020 Maintainer

jiyer2016 Sep 2, 2020 Author

withsmilo Sep 2, 2020 Maintainer

jiyer2016
Sep 1, 2020

Replies: 2 comments 4 replies

yubozhao
Sep 1, 2020

parano Sep 2, 2020
Maintainer

jiyer2016 Sep 2, 2020
Author

withsmilo
Sep 2, 2020
Maintainer

jiyer2016 Sep 2, 2020
Author

withsmilo Sep 2, 2020
Maintainer