We fine-tune an entailment (NLI) model with 12 million parameters and use it in an intent classification service.
The core classifier service is located in the server
folder.
The preferred method of running is containerized using your favorite container tool.
An example invocation to run the service locally
is in the justfile
and can be run with something like
# modify container_tool variable if you use Docker
just serve 8080
if you have the just tool installed.
Alternatively, the script server/server.py
can be run
directly from an appropriate Python environment
(it can be set up from server/requirements.txt
).
This should also pick up the GPU device by default.
The demo version of the classifier is deployed to my personal cluster at intents.cluster.megaver.se. It's an economically built cluster, so the performance isn't great, but you can test it with a request tool of your choice.
The Kubernetes manifest for deploying the service is shown on docs.cluster.megaver.se
There is a benchmarking client provided in the client
folder.
It computes the average accuracy and F1 scores for the label classes.
You can invoke it on the test part of the ATIS dataset with
# pip install -r client/requirements.txt -- if necessary
just benchmark
The following endpoints are implemented.
- Query is provided as
text
key. - You get back an array of most probable intents (maximum 3).
- The
/predict
endpoint accepts therequested_model
key that can select a specific model. - Several models can be specified as an argument or using the
MODEL
environment variable. The first model is the default one.
Returns the string OK
with code 200 when all the models are ready for inference.
This endpoint returns information about the service, such as version (when packaging with the Docker image workflow it is derived from a tag name) and available models.
We run the local ATIS benchmark as discussed above:
The model obtains the accuracy of almost 98% on the test data. Among the 18 model errors we have
- 2 rows with the unknown label
day_name
- 1 row that seems to be correctly classified by our model as
airfare
- 1 row that seems to be correctly classified by our model as
flight+airfare
- 5 rows that seem to be correctly classified by our model as
quantity
- 6 rows where
flight
andflight+airfare
are mixed up - 1 row similarly about
flight_no
- 1 cut-off phrase
- 1 genuine mistake (
airport
instead of aflight
)
Overall, we are quite happy about the model's classification performance.
The service in the cluster isn't able to handle the --jobs 64
parameter
as above for the local test, so we use the --jobs 3
to test it:
We have 1 failure, but otherwise the classification results are, as expected, the same. Our service has an average request time of 0.5 seconds and a throughput of 6 requests per second.
One could improve the throughput by performing an inference on a GPU instance (we get about 100 requests per second on a NVIDIA 3090 card even with a naive implementation) and batching the requests using a message queue.
See docs/README.md to learn how the model was selected and fine-tuned.
See CONTRIBUTING.md for the code style and workflow information.