Adding support for a custom inference service #905
Replies: 2 comments 4 replies
-
Hi @andrewsi-z, currently, there isn't a document on how to add a new custom inference backend, and we would really love to add a more generic API for users to integrate BentoML with their custom inference backend. For inference backends that provides a Python API, it is relatively easy to do via the BentoArtifact abstraction. But for inference backend that runs as a separate service, I think the biggest blocker to support it right now is we need to add a service startup callback hook in BentoML that allows the user to launch their model service backend processes. However, if your custom inference backend is deployed separately(running in a separate container or another machine), you can always invoke an RPC/HTTP call from your BentoService API callback function to access this custom inference backend. And it will still allow your workload to benefit from the micro-batching framework that the BentoML API server provides. If you are unsure about your approach, I'd be happy to take a look at your design and discuss it here or over a call. This is a super interesting use case that I think many others in the community would be interested in. Any contributions around it are definitely welcomed! |
Beta Was this translation helpful? Give feedback.
-
@parano I am revisiting this as we are interested in getting bentoML working with an ONNX-MLIR based backend. ONNX-MLIR is a compiler that works with LLVM to generate a .so library (which uses pybind with a python based front end) based on an input ONNX model. I have an early implementation that I am starting to test with here. I am curious if the project would be willing to accept a custom artifact to support an ONNX-MLIR .so and what thoughts you might have around this. This is obviously a bit of a different approach than a full runtime, but has some very interesting capability and is quite streamlined in what it produces. |
Beta Was this translation helpful? Give feedback.
-
Hello!
We are in the very early stages of exploring porting bentoML for use on our platform.
One of the requirements that we would have is to add support for a custom inference backend (based on ONNX-MLIR). Is there any guidance that outlines the impacts for adding support for a new inference service (I did look, but didn't quite find anything)? If so, it would help speed up our exploration efforts... (If no such reference exists and we go forward, we will be happy to write something up and contribute).
Thank you
Beta Was this translation helpful? Give feedback.
All reactions