serving keras runner with two Input() layer #2491

cchrkoc · 2022-05-24T14:26:16Z

cchrkoc
May 24, 2022

Hi,
I want to serve a keras model in a runner that has two input layers. This is required as the model is based on a transformer and expects the tokenization + masks as sequences of integer inputs.
So if I use the model from keras it would look like this for a single sequence of length 512.

dummy_token = np.zeros((1, 512))
dummy_mask = np.zeros((1, 512))
model = tf.keras.model.load_model(path_to_my_model)
prediction = model.predict([dummy_token, dummy_mask])

My problem is, that i cannont find a way for the bentoml.keras.runner to serve this model. I do the following:

keras_runner = bentoml.keras.load_runner("same_model:latest")
dummy_token = np.zeros((512))
dummy_mask = np.zeros((512))
res = keras_runner.run([dummy_token, dummy_mask])

This runs into a problem as the runner interprets the input as [None, 2, 512] instead of the required [[None, 512], [None, 512]]. I can understand how this happens due to bentos adaptive batching but Im yet unable to come up with a solution.

Are multi input model just not supported as of yet or is there a specific way I can tell bento that I basically required it to do adaptive batching on both Inputs?
Is there maybe a way to tackle this with a custom runner? I did read a bit into the source code but as of yet I have not tried to implement one by myself.

Answered by parano

Jun 8, 2022

Hi @cchrkoc - Multi input models are well supported in BentoML.

If you don't need adaptive batching for your use case, you can save your model again with batchable=False signature (with the new 1.0.0rc1 release):

bentoml.keras.save_model("same_model:latest", model_inst, signatures={'predict': {'batchable': False}})

keras_runner = bentoml.keras.get("same_model:latest").to_runner()
keras_runner.init_local()

dummy_token = np.zeros((1, 512))
dummy_mask = np.zeros((1, 512))
res = keras_runner.predict.run([dummy_token, dummy_mask])

It is also possible to do adaptive batching on both inputs. In your case, the input batch dimension should be set to 1, since the first dimension is always 2 (cont…

View full answer

parano · 2022-06-08T20:57:41Z

parano
Jun 8, 2022
Maintainer

Hi @cchrkoc - Multi input models are well supported in BentoML.

If you don't need adaptive batching for your use case, you can save your model again with batchable=False signature (with the new 1.0.0rc1 release):

bentoml.keras.save_model("same_model:latest", model_inst, signatures={'predict': {'batchable': False}})

keras_runner = bentoml.keras.get("same_model:latest").to_runner()
keras_runner.init_local()

dummy_token = np.zeros((1, 512))
dummy_mask = np.zeros((1, 512))
res = keras_runner.predict.run([dummy_token, dummy_mask])

It is also possible to do adaptive batching on both inputs. In your case, the input batch dimension should be set to 1, since the first dimension is always 2 (containing token array and mask array). You can learn more about model/runner signatures from here: https://docs.bentoml.org/en/latest/concepts/model.html#model-signatures

bentoml.keras.save_model("same_model:latest", model_inst, signatures={'predict': {'batchable': True, 'batch_dim': (1, 0)}})

keras_runner = bentoml.keras.get("same_model:latest").to_runner()
keras_runner.init_local()

dummy_token = np.zeros((1, 512))
dummy_mask = np.zeros((1, 512))
res = keras_runner.predict.run([dummy_token, dummy_mask])

Note that in both cases, the run input is identical to the input data type to the predict method. In the case of batch=True, multiple inputs will be grouped into a larger batch and run inference at once.

5 replies

cchrkoc Jun 9, 2022
Author

Hey, thank you this solved my problem and behavior is now as expected. There are some small typos in the response: save_model takes signatures as keyword and the runner should be called like this keras_runner.predict.run(dummy_token, dummy_mask). :)
Thanks again for the help

parano Jun 9, 2022
Maintainer

Hi @cchrkoc - I'm surprised that the expected runner call is keras_runner.predict.run(dummy_token, dummy_mask).

Note that the run signature should be same as your model signature. If you are calling model.predict([dummy_token, dummy_mask]), then the expected runner invocation should be keras_runner.predict.run([dummy_token, dummy_mask]) as well.

If the expected model behavior is model.predict(dummy_token, dummy_mask), then you should pay attention to the batch_dim parameter. It should probably be using signatures={'predict': {'batchable': True, 'batch_dim': 0}} instead. In which case, you can use keras_runner.predict.run(dummy_token, dummy_mask). Since both array have 0 as the batch dimension.

Let me know if you have any questions on that, happy to help!

cchrkoc Jun 9, 2022
Author

Hi @parano - I have now read thru the documentation and am very sure that the correct batching axis for the inputs is 1, as I want two inputs of [tokens1, masks1] and [tokens2, masks2] to be batched as [[tokens1, tokens2] , [masks1, masks2]]
I do get some (maybe) weird behavior:
If I just use the model in vanilla keras the call for the predict method is model.predict([tokens, masks]).
However if the do the same in the bentoml runner keras_runner.predict.run([tokens, masks]) it doesnt interpret the list as two inputs but as one and I get an error as the input has the wrong dimension.
However if I do keras_runner.predict.run(tokens, masks) it works. It also works for batches IF I construct them as keras_runner.predict.run([token_list_1, token_list_2, .., token_list_n], [mask_list_1, mask_list_2, .. , mask_list_n])
Im not sure bento would batch like this however.
Am I doing something wrong?
I will put the model summary below:

Layer (type) Output Shape Param # Connected to

input_ids (InputLayer) [(None, 512)] 0
attention_mask (InputLayer) [(None, 512)] 0

bert (Custom>TFBertMainLayer) {'last_hidden_state' 109081344 input_ids[0][0]
attention_mask[0][0]

bidirectional (Bidirectional) (None, 1024) 5246976 bert[0][0]
dense_1 (Dense) (None, 21) 10773 dropout_37[0][0]

parano Jun 9, 2022
Maintainer

@cchrkoc note that runner.run([tokens, masks]) is very different from runner.run(tokens, masks). The batch_dim is applied to each function argument independently. (apologize that the documentation doesn't explain the multi-argument case clearly).

If you are looking to do keras_runner.predict.run([token_list_1, token_list_2, .., token_list_n], [mask_list_1, mask_list_2, .. , mask_list_n]), then the batch_dim should be 0.

If you need keras_runner.predict.run([[token_list_1, token_list_2, .., token_list_n], [mask_list_1, mask_list_2, .. , mask_list_n]]), then batch_dim should be 1.

parano Jun 17, 2022
Maintainer

@cchrkoc to correct my answer above - currently, if the runner input is a regular array, it only supports batch_dim=0 because BentoML has little information about the array structure. In your case, if you want to use batch_dim=1, you will need to convert the input into an np.ndarray before calling the run method. e.g.:

keras_runner.predict.run(np.array([[token_list_1, token_list_2, .., token_list_n]), np.array([mask_list_1, mask_list_2, .. , mask_list_n]]))```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BentoML

serving keras runner with two Input() layer #2491

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

BentoML

serving keras runner with two Input() layer #2491

cchrkoc May 24, 2022

Replies: 1 comment · 5 replies

parano Jun 8, 2022 Maintainer

cchrkoc Jun 9, 2022 Author

parano Jun 9, 2022 Maintainer

cchrkoc Jun 9, 2022 Author

parano Jun 9, 2022 Maintainer

parano Jun 17, 2022 Maintainer

cchrkoc
May 24, 2022

Replies: 1 comment 5 replies

parano
Jun 8, 2022
Maintainer

cchrkoc Jun 9, 2022
Author

parano Jun 9, 2022
Maintainer

cchrkoc Jun 9, 2022
Author

parano Jun 9, 2022
Maintainer

parano Jun 17, 2022
Maintainer