feature: CTranslated2 Framework + Adaptive-batching for custom runner #4851
-
Feature requestHi, It would be nice to enable CTranslate2 inference within bentoML (https://github.com/OpenNMT/CTranslate2). This library implements a custom runtime that applies many performance optimization techniques such as weights quantization, layers fusion, batch reordering, etc., to accelerate and reduce the memory usage of Transformer models on CPU and GPU. For example, for MarianMT transformer model the following code is used (https://opennmt.net/CTranslate2/guides/transformers.html#marianmt):
Maybe this already possible with custom runner? MotivationCTranslate2 is a C++ and Python library for efficient inference with Transformer models OtherNo response |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Hi, I managed to make a bentoml service worked on my local machine with custom runner indeed. However, how could I use adaptive-batching since I cannot enable batching for the target model signature during The converted model consist of a directory with 2 files :
The tokenizer is used from a call to I tried to specify |
Beta Was this translation helpful? Give feedback.
-
This is possible, and you can use the new service APIs to make a BentoML service, read the latest docs for how to do it. |
Beta Was this translation helpful? Give feedback.
This is possible, and you can use the new service APIs to make a BentoML service, read the latest docs for how to do it.