The Triton Inference Server is available as buildable source code, but the easiest way to install and run Triton is to use the pre-built Docker image available from the NVIDIA GPU Cloud (NGC).
Before you can use the Triton Docker image you must install Docker. If you plan on using a GPU for inference you must also install the NVIDIA Container Toolkit. DGX users should follow Preparing to use NVIDIA Containers.
Pull the image using the following command.
$ docker pull nvcr.io/nvidia/tritonserver:<xx.yy>-py3
Where <xx.yy> is the version of Triton that you want to pull.
The model repository is the directory where you place the models that you want Triton to serve. An example model repository is included in the docs/examples/model_repository. Before using the repository, you must fetch any missing model definition files from their public model zoos via the provided script.
$ cd docs/examples
$ ./fetch_models.sh
Triton is optimized to provide the best inferencing performance by using GPUs, but it can also work on CPU-only systems. In both cases you can use the same Triton Docker image.
Use the following command to run Triton with the example model repository you just created. The NVIDIA Container Toolkit must be installed for Docker to recognize the GPU(s). The --gpus=1 flag indicates that 1 system GPU should be made available to Triton for inferencing.
$ docker run --gpus=1 --rm -p8000:8000 -p8001:8001 -p8002:8002 -v/full/path/to/docs/examples/model_repository:/models nvcr.io/nvidia/tritonserver:<xx.yy>-py3 tritonserver --model-repository=/models
Where <xx.yy> is the version of Triton that you want to use (and pulled above). After you start Triton you will see output on the console showing the server starting up and loading the model. When you see output like the following, Triton is ready to accept inference requests.
+----------------------+---------+--------+
| Model | Version | Status |
+----------------------+---------+--------+
| <model_name> | <v> | READY |
| .. | . | .. |
| .. | . | .. |
+----------------------+---------+--------+
...
...
...
I1002 21:58:57.891440 62 grpc_server.cc:3914] Started GRPCInferenceService at 0.0.0.0:8001
I1002 21:58:57.893177 62 http_server.cc:2717] Started HTTPService at 0.0.0.0:8000
I1002 21:58:57.935518 62 http_server.cc:2736] Started Metrics Service at 0.0.0.0:8002
All the models should show "READY" status to indicate that they loaded correctly. If a model fails to load the status will report the failure and a reason for the failure. If your model is not displayed in the table check the path to the model repository and your CUDA drivers.
On a system without GPUs, Triton should be run without using the --gpus flag to Docker, but is otherwise identical to what is described above.
$ docker run --rm -p8000:8000 -p8001:8001 -p8002:8002 -v/full/path/to/docs/examples/model_repository:/models nvcr.io/nvidia/tritonserver:<xx.yy>-py3 tritonserver --model-repository=/models
Because the --gpus flag is not used, a GPU is not available and Triton will therefore be unable to load any model configuration that requires a GPU.
Use Triton’s ready endpoint to verify that the server and the models are ready for inference. From the host system use curl to access the HTTP endpoint that indicates server status.
$ curl -v localhost:8000/v2/health/ready
...
< HTTP/1.1 200 OK
< Content-Length: 0
< Content-Type: text/plain
The HTTP request returns status 200 if Triton is ready and non-200 if it is not ready.
Use docker pull to get the client libraries and examples image from NGC.
$ docker pull nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk
Where <xx.yy> is the version that you want to pull. Run the client image.
$ docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk
From within the nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk image, run the example image-client application to perform image classification using the example densenet_onnx model.
To send a request for the densenet_onnx model use an image from the /workspace/images directory. In this case we ask for the top 3 classifications.
$ /workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg
Request 0, batch size 1
Image '/workspace/images/mug.jpg':
15.346230 (504) = COFFEE MUG
13.224326 (968) = CUP
10.422965 (505) = COFFEEPOT