-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mlserver example #1110
base: main
Are you sure you want to change the base?
Mlserver example #1110
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,75 @@ | ||
# **Step 1: Installation** | ||
|
||
Install DeepSparse and MLServer. | ||
|
||
```bash | ||
pip install -r requirements.txt | ||
``` | ||
|
||
# **Step 2: Write Custom Runtime** | ||
|
||
We need to write a [Custom Inference Runtime](https://mlserver.readthedocs.io/en/stable/user-guide/custom.html) to use DeepSparse within MLServer. | ||
|
||
### Implement `load()` and `predict()` | ||
|
||
First, we implement the `load()` and `predict()` methods in `models/text-classification-model/models.py`. Note that your implementation of the of `load()` and `predict()` will vary by the task that you choose. | ||
|
||
Here's an example for text classification: | ||
```python | ||
from mlserver import MLModel | ||
from mlserver.codecs import decode_args | ||
from typing import List | ||
from deepsparse import Pipeline | ||
|
||
class DeepSparseRuntime(MLModel): | ||
async def load(self) -> bool: | ||
# compiles the pipeline | ||
self._pipeline = Pipeline.create( | ||
task = self._settings.parameters.task, # from model-settings.json | ||
model_path = self._settings.parameters.model_path, # from model-settings.json | ||
batch_size = self._settings.parameters.batch_size, # from model-settings.json | ||
sequence_length = self._settings.parameters.sequence_length, # from model-settings.json | ||
) | ||
return True | ||
|
||
@decode_args | ||
async def predict(self, sequences: List[str]) -> List[str]: | ||
# runs the inference | ||
prediction = self._pipeline(sequences=sequences) | ||
return prediction.labels | ||
``` | ||
|
||
### Create `model-settings.json` | ||
|
||
Second, we create a config at `models/text-classification-model/model-settings.json`. In this file, we will specify the location of the implementation of the custom runtime as well as the | ||
paramters of the deepsparse inference session. | ||
|
||
```json | ||
{ | ||
"name": "text-classification-model", | ||
"implementation": "models.DeepSparseRuntime", | ||
"parameters": { | ||
"task": "text-classification", | ||
"model_path": "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none", | ||
"batch_size": 1, | ||
"sequence_length": 128 | ||
} | ||
} | ||
``` | ||
|
||
# **Step 3: Launch MLServer** | ||
|
||
Launch the server with the CLI: | ||
|
||
```bash | ||
mlserver start ./models/text-classification-model/ | ||
``` | ||
|
||
# **Step 4: Send Inference Requests** | ||
|
||
Now, an inference endpoint is exposed at `http://localhost:8080/v2/models/text-classification-model/infer`. `client.py` is a sample script for requesting the endpoint. | ||
|
||
Run the following: | ||
```python | ||
python3 client.py | ||
``` |
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,27 @@ | ||||||||||||||||||||||
import requests, threading | ||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. would suggest a few in line comments for self-documentation |
||||||||||||||||||||||
|
||||||||||||||||||||||
NUM_THREADS = 2 | ||||||||||||||||||||||
URL = "http://localhost:8080/v2/models/text-classification-model/infer" | ||||||||||||||||||||||
sentences = ["I hate using GPUs for inference", "I love using DeepSparse on CPUs"] * 100 | ||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @rsnm2 see suggestion below |
||||||||||||||||||||||
|
||||||||||||||||||||||
def tfunc(text): | ||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. would rename to something more descriptive like |
||||||||||||||||||||||
inference_request = { | ||||||||||||||||||||||
"inputs": [ | ||||||||||||||||||||||
{ | ||||||||||||||||||||||
"name": "sequences", | ||||||||||||||||||||||
"shape": [1], | ||||||||||||||||||||||
"datatype": "BYTES", | ||||||||||||||||||||||
"data": [text], | ||||||||||||||||||||||
}, | ||||||||||||||||||||||
] | ||||||||||||||||||||||
} | ||||||||||||||||||||||
resp = requests.post(URL, json=inference_request).json() | ||||||||||||||||||||||
for output in resp["outputs"]: | ||||||||||||||||||||||
print(output["data"]) | ||||||||||||||||||||||
Comment on lines
+19
to
+20
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. executing a list printout while multithreaded may cause a race condition, any reason to not return the value and print in sequence at the end? (ie consider thread 1 and thread 2 happen to execute exactly at the same time, they will print their lines at the same time and might not tell which is which) |
||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
threads = [threading.Thread(target=tfunc, args=(sentence,)) for sentence in sentences[:NUM_THREADS]] | ||||||||||||||||||||||
for thread in threads: | ||||||||||||||||||||||
thread.start() | ||||||||||||||||||||||
for thread in threads: | ||||||||||||||||||||||
thread.join() | ||||||||||||||||||||||
Comment on lines
+23
to
+27
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it looks like this creates You can do this out of the box with
Suggested change
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
{ | ||
"name": "text-classification-model", | ||
"implementation": "models.DeepSparseRuntime", | ||
"parameters": { | ||
"task": "text-classification", | ||
"model_path": "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none", | ||
"batch_size": 1, | ||
"sequence_length": 128 | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
from mlserver import MLModel | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is great, love that it works out of the box - let's throw in the serving command as a comment just for convenience |
||
from mlserver.codecs import decode_args | ||
from typing import List | ||
from deepsparse import Pipeline | ||
|
||
class DeepSparseRuntime(MLModel): | ||
async def load(self) -> bool: | ||
self._pipeline = Pipeline.create( | ||
task = self._settings.parameters.task, | ||
model_path = self._settings.parameters.model_path, | ||
batch_size = self._settings.parameters.batch_size, | ||
sequence_length = self._settings.parameters.sequence_length, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is there a place for generic kwargs in the settings? Would be cool if we could use that instead to dump extra pipeline args so we can get full generic pipeline support out of the box |
||
) | ||
return True | ||
|
||
@decode_args | ||
async def predict(self, sequences: List[str]) -> List[str]: | ||
prediction = self._pipeline(sequences=sequences) | ||
return prediction.labels |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
mlserver | ||
deepsparse[transformers] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
best to add an intro paragraph to give users a heads up of what this example does.