Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Binary tensor data extension docs #419

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
274 changes: 274 additions & 0 deletions docs/modelserving/data_plane/binary_tensor_data_extension.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,274 @@
# Binary Tensor Data Extension

The Binary Tensor Data Extension allows clients to send and receive tensor data in a binary format in
the body of an HTTP/REST request. This extension is particularly useful for sending and receiving FP16 data as
there is no specific data type for a 16-bit float type in the Open Inference Protocol and large tensors
for high-throughput scenarios.

## Overview

Tensor data represented as binary data is organized in little-endian byte order, row major, without stride or
padding between elements. All tensor data types are representable as binary data in the native size of the data type.
For BOOL type element true is a single byte with value 1 and false is a single byte with value 0.
For BYTES type an element is represented by a 4-byte unsigned integer giving the length followed by the actual bytes.
The binary data for a tensor is delivered in the HTTP body after the JSON object (see Examples).

The binary tensor data extension uses parameters to indicate that an input or output tensor is communicated as binary data.

The `binary_data_size` parameter is used in `$request_input` and `$response_output` to indicate that the input or output tensor is communicated as binary data:

- "binary_data_size" : int64 parameter indicating the size of the tensor binary data, in bytes.

The `binary_data` parameter is used in `$request_output` to indicate that the output should be returned from KServe runtime
as binary data.

- "binary_data" : bool parameter that is true if the output should be returned as binary data and false (or not given) if the
tensor should be returned as JSON.

The `binary_data_output` parameter is used in `$inference_request` to indicate that all outputs should be returned from KServe runtime as binary data, unless overridden by "binary_data" on a specific output.

- "binary_data_output" : bool parameter that is true if all outputs should be returned as binary data and false
(or not given) if the outputs should be returned as JSON. If "binary_data" is specified on an output it overrides this setting.

When one or more tensors are communicated as binary data, the HTTP body of the request or response
will contain the JSON inference request or response object followed by the binary tensor data in the same order as the
order of the input or output tensors are specified in the JSON.

- If any binary data is present in the request or response the `Inference-Header-Content-Length` header must be provided to
give the length of the JSON object, and Content-Length continues to give the full body length (as HTTP requires).


## Examples

### Sending and Receiving Binary Data

For the following request the input tensors `input0` and `input2` are sent as binary data while `input1` is sent as non-binary data. Note that the `input0` and `input2` input tensors have a parameter `binary_data_size` which represents the size of the binary data.

The output tensor `output0` must be returned as binary data as that is what is requested by setting the `binary_data` parameter to true. Also note that the size of the JSON part is provided in the `Inference-Header-Content-Length` and the total size of the binary data is reflected in the `Content-Length` header.

```shell
POST /v2/models/mymodel/infer HTTP/1.1
Host: localhost:8000
Content-Type: application/octet-stream
Inference-Header-Content-Length: <xx> # Json length
Content-Length: <xx+19> # Json length + binary data length (In this case 16 + 3 = 19)
{
"model_name" : "mymodel",
"inputs" : [
{
"name" : "input0",
"shape" : [ 2, 2 ],
"datatype" : "FP16",
"parameters" : {
"binary_data_size" : 16
}
},
{
"name" : "input1",
"shape" : [ 2, 2 ],
"datatype" : "UINT32",
"data": [[1, 2], [3, 4]]
},
{
"name" : "input2",
"shape" : [ 3 ],
"datatype" : "BOOL",
"parameters" : {
"binary_data_size" : 3
}
}
],
"outputs" : [
{
"name" : "output0",
"parameters" : {
"binary_data" : true
}
},
{
"name" : "output1"
}
]
}
<16 bytes of data for input0 tensor>
<3 bytes of data for input2 tensor>
```

Assuming the model returns a [ 3, 2 ] tensor of data type FP16 and a [2, 2] tensor of data type FP32 the following response would be returned.

```shell
HTTP/1.1 200 OK
Content-Type: application/octet-stream
Inference-Header-Content-Length: <yy> # Json length
Content-Length: <yy+16> # Json length + binary data length (In this case 16)
{
"outputs" : [
{
"name" : "output0",
"shape" : [ 3, 2 ],
"datatype" : "FP16",
"parameters" : {
"binary_data_size" : 16
}
},
{
"name" : "output1",
"shape" : [ 2, 2 ],
"datatype" : "FP32",
"data" : [[1.203, 5.403], [3.434, 34.234]]
}
]
}
<16 bytes of data for output0 tensor>
```

=== "Inference Client Example"

```python
from kserve import ModelServer, InferenceRESTClient, InferRequest, InferInput
from kserve.protocol.infer_type import RequestedOutput
from kserve.inference_client import RESTConfig

fp16_data = np.array([[1.1, 2.22], [3.345, 4.34343]], dtype=np.float16)
uint32_data = np.array([[1, 2], [3, 4]], dtype=np.uint32)
bool_data = np.array([True, False, True], dtype=np.bool)

# Create input tensor with binary data
input_0 = InferInput(name="input_0", datatype="FP16", shape=[2, 2])
input_0.set_data_from_numpy(fp16_data, binary_data=True)
input_1 = InferInput(name="input_1", datatype="UINT32", shape=[2, 2])
input_1.set_data_from_numpy(uint32_data, binary_data=False)
input_2 = InferInput(name="input_2", datatype="BOOL", shape=[3])
input_2.set_data_from_numpy(bool_data, binary_data=True)

# Create request output
output_0 = RequestedOutput(name="output_0", binary_data=True)
output_1 = RequestedOutput(name="output_1", binary_data=False)

# Create inference request
infer_request = InferRequest(
model_name="mymodel",
request_id="2ja0ls9j1309",
infer_inputs=[input_0, input_1, input_2],
requested_outputs=[output_0, output_1],
)

# Create the REST client
config = RESTConfig(verbose=True, protocol="v2")
rest_client = InferenceRESTClient(config=config)

# Send the request
infer_response = await rest_client.infer(
"http://localhost:8000",
model_name="TestModel",
data=infer_request,
headers={"Host": "test-server.com"},
timeout=2,
)

# Read the binary data from the response
output_0 = infer_response.outputs[0]
fp16_output = output_0.as_numpy()

# Read the non-binary data from the response
output_1 = infer_response.outputs[1]
fp32_output = output_1.data # This will return the data as a list
fp32_output_arr = output_1.as_numpy()
```

### Requesting All The Outputs To Be In Binary Format

For the following request, `binary_data_output` is set to true to receive all the outputs as binary data. Note that the
`binary_data_output` is set in the `$inference_request` parameters field, not in the `$inference_input` parameters field. This parameter can be overridden for a specific output by setting `binary_data` parameter to false in the `$request_output`.

```shell
POST /v2/models/mymodel/infer HTTP/1.1
Host: localhost:8000
Content-Type: application/json
Content-Length: 75
{
"model_name": "my_model",
"inputs": [
{
"name": "input_tensor",
"datatype": "FP32",
"shape": [1, 2],
"data": [[32.045, 399.043]],
}
],
"parameters": {
"binary_data_output": true
}
}
```
Assuming the model returns a [ 3, 2 ] tensor of data type FP16 and a [2, 2] tensor of data type FP32 the following response would be returned.

```shell
HTTP/1.1 200 OK
Content-Type: application/octet-stream
Inference-Header-Content-Length: <yy> # Json length
Content-Length: <yy+48> # Json length + binary data length (In this case 16 + 32)
{
"outputs" : [
{
"name" : "output_tensor0",
"shape" : [ 3, 2 ],
"datatype" : "FP16",
"parameters" : {
"binary_data_size" : 16
}
},
{
"name" : "output_tensor1",
"shape" : [ 2, 2 ],
"datatype" : "FP32",
"parameters": {
"binary_data_size": 32
}
}
]
}
<16 bytes of data for output_tensor0 tensor>
<32 bytes of data for output_tensor1 tensor>
```

=== "Inference Client Example"

```python
from kserve import ModelServer, InferenceRESTClient, InferRequest, InferInput
from kserve.protocol.infer_type import RequestedOutput
from kserve.inference_client import RESTConfig

fp32_data = np.array([[32.045, 399.043]], dtype=np.float32)

# Create the input tensor
input_0 = InferInput(name="input_0", datatype="FP32", shape=[1, 2])
input_0.set_data_from_numpy(fp16_data, binary_data=False)

# Create inference request with binary_data_output set to True
infer_request = InferRequest(
model_name="mymodel",
request_id="2ja0ls9j1309",
infer_inputs=[input_0],
parameters={"binary_data_output": True}
)

# Create the REST client
config = RESTConfig(verbose=True, protocol="v2")
rest_client = InferenceRESTClient(config=config)

# Send the request
infer_response = await rest_client.infer(
"http://localhost:8000",
model_name="TestModel",
data=infer_request,
headers={"Host": "test-server.com"},
timeout=2,
)

# Read the binary data from the response
output_0 = infer_response.outputs[0]
fp16_output = output_0.as_numpy()
output_1 = infer_response.outputs[1]
fp32_output_arr = output_1.as_numpy()
```
2 changes: 2 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@ nav:
- Model Serving Data Plane: modelserving/data_plane/data_plane.md
- V1 Inference Protocol: modelserving/data_plane/v1_protocol.md
- Open Inference Protocol (V2 Inference Protocol): modelserving/data_plane/v2_protocol.md
- Open Inference Protocol Extensions:
- Binary Tensor Data Extension: modelserving/data_plane/binary_tensor_data_extension.md
- Serving Runtimes: modelserving/servingruntimes.md
- Model Serving Runtimes:
- Supported Model Frameworks/Formats:
Expand Down