Skip to content

Commit

Permalink
Separation of data acquisition by GPU
Browse files Browse the repository at this point in the history
The idea of the previous template to get data for all GPUs with one
request works as long as there is no problem with the GPU. If there is
a problem with one GPU, the data error affects all GPUs at once

In this update, the metrics retrieval process is split across
individual GPUs, while still retrieving all metrics with a single query
  • Loading branch information
giDai7ja committed Jul 15, 2023
1 parent d1ceb37 commit 271085f
Show file tree
Hide file tree
Showing 3 changed files with 98 additions and 100 deletions.
40 changes: 21 additions & 19 deletions Server_Hardware/Other/template_nvidia-smi_multigpu/6.0/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@

This template is for Zabbix to monitor multiple NVidia GPUs

This template uses only one user parameter, receives all parameters in one request and requires no additional scripts
This template uses two user parameters. One to search for graphic cards and one to retrieve metrics from each card

### Features

* Low-level discovery of all the graphics Nvidia cards
* Low-level discovery of all the graphics NVidia cards
* Prototype items and triggers for the most important parameters
* General status panel

Expand All @@ -18,7 +18,7 @@ This template uses only one user parameter, receives all parameters in one reque
* Restart the zabbix-agent
* Import template zbx_NVidia_GPUs.yaml and link this template to the monitored host

This template is set up and tested on a server with nine Nvidia graphics cards. Comments, suggestions and help to improve this template are welcome
This template is set up and tested on a server with nine NVidia graphics cards. Comments, suggestions and help to improve this template are welcome

## Author

Expand All @@ -36,29 +36,31 @@ There are no template links in this template.

|Name|Description|Type|Key and additional info|
|----|-----------|----|----|
|GPU Data|<p>Data collection by GPUs</p>|`SNMP agent`|gpu.data<p>Update: 1m</p>|
|GPU Card|<p>GPU detection</p>|`Dependent item`|gpu.id|

## Items collected

Common Items
|Name|Description|Type|Key and additional info|
|----|-----------|----|----|
|GPU Count|<p>Number of GPUs detected</p>|`Dependent items`|gpu.count|
|GPU Driver Version|<p>GPU driver version</p>|`Dependent items`|gpu.driver_version|
|GPU Power Total|<p>Power consumption of all GPUs</p>|`Dependent items`|gpu.power_total|
|GPUs Maximum Temperature|<p>Temperature of the hottest GPU</p>|`Dependent items`|gpu.temp_max|
|GPU Utilization Total|<p>Total GPU utilisation</p>|`Dependent items`|gpu.utilization_total|

Items for each GPU found
|GPU Discovery|<p>GPU detection data</p>|'Zabbix agent'|gpu.discovery<p>Update: 1h</p>|
|GPU Count|<p>Number of GPUs detected</p>|`Dependent item`|gpu.county|
|GPU Driver Version|<p>GPU driver version</p>|`Dependent item`|gpu.driver_versiony|
|GPU Power Total|<p>Power consumption of all GPUs</p>|`Calculated item`|gpu.power_totaly<p>Update: 1m</p>|
|GPUs Maximum Temperature|<p>Temperature of the hottest GPU</p>|`Calculated item`|gpu.temp_maxy<p>Update: 1m</p>|
|GPU Utilization Total|<p>Total GPU utilisation</p>|`Calculated item`|gpu.utilization_totaly<p>Update: 1m</p>|

Item prototypes for each GPU found
|Name|Description|Type|Key and additional info|
|----|-----------|----|----|
|GPU Power|Power consumption of the GPU|`Dependent items`|gpu.power|
|GPU Total Memory|GPU memory capacity|`Dependent items`|gpu.mtotal|
|GPU Used Memory|The amount of GPU memory used|`Dependent items`|gpu.mused|
|GPU Free Memory|Amount of free GPU memory|`Dependent items`|gpu.mfree|
|GPU Utilisation|GPU utilisation|`Dependent items`|gpu.utilization|
|GPU Temperature|GPU Temperature|`Dependent items`|gpu.temperature|
|GPU Fan Speed|GPU Fan Speed|`Dependent items`|gpu.fan|
|GPU {#GPUID} Data|<p>Data collection</p>|'Zabbix agent'|gpu.card[{#GPUID}]<p>Update: 1m</p>|
|GPU {#GPUID} Power - {#NAME}|<p>Power consumption of the GPU</p>|`Dependent items`|gpu.power.[{#GPUID}]|
|GPU {#GPUID} Memory Total - {#NAME}|<p>GPU memory capacity</p>|`Dependent items`|gpu.mtotal.[{#GPUID}]|
|GPU {#GPUID} Memory Used - {#NAME}|<p>The amount of GPU memory used</p>|`Dependent items`|gpu.mused.[{#GPUID}]|
|GPU {#GPUID} Memory Free - {#NAME}|<p>Amount of free GPU memory</p>|`Dependent items`|gpu.mfree.[{#GPUID}]|
|GPU {#GPUID} Utilization - {#NAME}|<p>GPU utilisation</p>|`Dependent items`|gpu.utilization.[{#GPUID}]|
|GPU {#GPUID} Temperature - {#NAME}|<p>GPU Temperature</p>|`Dependent items`|gpu.temperature.[{#GPUID}]|
|GPU {#GPUID} Fan Speed - {#NAME}|<p>GPU Fan Speed</p>|`Dependent items`|gpu.fan.[{#GPUID}]|

## Triggers

Expand All @@ -68,4 +70,4 @@ Items for each GPU found
|GPU {#GPUID} Temperature is extremely high|The temperature of the GPU is very high. Possibility of failure|last(/Nvidia Multi-GPU/gpu.temperature.[{#GPUID}])>=80|`High`|
|GPU {#GPUID} Temperature is high|Temperature of the graphics processor is high|<p>last(/Nvidia Multi-GPU/gpu.temperature.[{#GPUID}])>=65</p><p>**Dependencies**: GPU {#GPUID} Temperature is extremely high</p>|`Average`|
|Problem with the fan|Fan does not spin when GPU is hot|last(/Nvidia Multi-GPU/gpu.fan.[{#GPUID}])=0 and last(/Nvidia Multi-GPU/gpu.temperature.[{#GPUID}])>60|`High`|
|Data retrieval error|Problem with data retrieval|nodata(/Nvidia Multi-GPU/gpu.driver_version,3m)=1|`Disaster`|
|Error receiving data for GPU {#GPUID}|Problem with data retrieval|nodata(/Nvidia Multi-GPU/gpu.utilization.[{#GPUID}],3m)=1|`Disaster`|
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
UserParameter=gpu.data,nvidia-smi --query-gpu=gpu_bus_id,temperature.gpu,memory.total,memory.used,memory.free,fan.speed,utilization.gpu,power.draw,name,driver_version --format=csv,noheader,nounits | sed -e 's/, /,/g'
UserParameter=gpu.discovery,nvidia-smi --query-gpu=gpu_bus_id,name,driver_version --format=csv,noheader,nounits | sed -e 's/, /,/g'
UserParameter=gpu.card[*],nvidia-smi --query-gpu=temperature.gpu,memory.total,memory.used,memory.free,fan.speed,utilization.gpu,power.draw --format=csv,noheader,nounits -i $1 | sed -e 's/, /,/g'
Loading

0 comments on commit 271085f

Please sign in to comment.