Metrics

Triton provides Prometheus metrics indicating GPU and request statistics. By default, these metrics are available at http://localhost:8002/metrics. The metrics are only available by accessing the endpoint, and are not pushed or published to any remote server. The metric format is plain text so you can view them directly, for example:

$ curl localhost:8002/metrics

The tritonserver --allow-metrics=false option can be used to disable all metric reporting and --allow-gpu-metrics=false can be used to disable just the GPU Utilization and GPU Memory metrics. The --metrics-port option can be used to select a different port.

The following table describes the available metrics.

Category	Metric	Description	Granularity	Frequency
GPU Utilization	Power Usage	GPU instantaneous power	Per GPU	Per second
	Power Limit	Maximum GPU power limit	Per GPU	Per second
	Energy Consumption	GPU energy consumption in joules since Triton started	Per GPU	Per second
	GPU Utilization	GPU utilization rate (0.0 - 1.0)	Per GPU	Per second
GPU Memory	GPU Total Memory	Total GPU memory, in bytes	Per GPU	Per second
	GPU Used Memory	Used GPU memory, in bytes	Per GPU	Per second
Count	Request Count	Number of inference requests received by Triton (each request is counted as 1, even if the request contains a batch)	Per model	Per request
	Inference Count	Number of inferences performed (a batch of "n" is counted as "n" inferences)	Per model	Per request
	Execution Count	Number of inference batch executions (see Count Metrics)	Per model	Per request
Latency	Request Time	Cumulative end-to-end inference request handling time	Per model	Per request
	Queue Time	Cumulative time requests spend waiting in the scheduling queue	Per model	Per request
	Compute Input Time	Cumulative time requests spend processing inference inputs (in the framework backend)	Per model	Per request
	Compute Time	Cumulative time requests spend executing the inference model (in the framework backend)	Per model	Per request
	Compute Output Time	Cumulative time requests spend processing inference outputs (in the framework backend)	Per model	Per request

Count Metrics

For models that do not support batching, Request Count, Inference Count and Execution Count will be equal, indicating that each inference request is executed separately.

For models that support batching, the count metrics can be interpreted to determine average batch size as Inference Count / Execution Count. The count metrics are illustrated by the following examples:

Client sends a single batch-1 inference request. Request Count = 1, Inference Count = 1, Execution Count = 1.
Client sends a single batch-8 inference request. Request Count = 1, Inference Count = 8, Execution Count = 1.
Client sends 2 requests: batch-1 and batch-8. Dynamic batcher is not enabled for the model. Request Count = 2, Inference Count = 9, Execution Count = 2.
Client sends 2 requests: batch-1 and batch-1. Dynamic batcher is enabled for the model and the 2 requests are dynamically batched by the server. Request Count = 2, Inference Count = 2, Execution Count = 1.
Client sends 2 requests: batch-1 and batch-8. Dynamic batcher is enabled for the model and the 2 requests are dynamically batched by the server. Request Count = 2, Inference Count = 9, Execution Count = 1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metrics.md

metrics.md

Metrics

Count Metrics

Files

metrics.md

Latest commit

History

metrics.md

File metadata and controls

Metrics

Count Metrics