Skip to content

Latest commit

 

History

History
105 lines (72 loc) · 5.28 KB

slurm-monitor.md

File metadata and controls

105 lines (72 loc) · 5.28 KB

Slurm Monitor

Introduction

By default DeepOps deploys a monitoring stack alongside Slurm. This can be disabled by setting slurm_enable_monitoring to false.

DeepOps runs a dcgm-exporter container on all DGX nodes. This container exports various GPU utilization and metadata to a Prometheus database running on the slurm-metric nodes. The slurm-metric nodes also run a Grafana server that connects to the Prometheus database to visualize the data. Ad-hoc metric queries can be made against the Prometheus server, but monitoring is typically done through the dashboard on Grafana.

Exporters

A node-exporter and dcgm-exporter should be running on every node listed under slurm-node:

# View the DCGM Exporter container on a DGX node
$ sudo docker ps
CONTAINER ID        IMAGE                              COMMAND                  CREATED             STATUS
PORTS                    NAMES
139fe402640f        quay.io/prometheus/node-exporter   "/bin/node_exporter …"   56 minutes ago      Up 56 minutes                                docker.node-exporter.service
0da9e3f1a7c8        nvidia/dcgm-exporter               "/usr/bin/dcgm-expor…"   56 minutes ago      Up 56 minutes       0.0.0.0:9400->9400/tcp   docker.dcgm-exporter.service

Grafana

The default configuration of Grafana comes with two dashboards. There is a dashboard to monitor Slurm usage and jobs and another GPU dashboard to monitor GPU utilization, power, etc.

In order to gain access to Grafana visit the url at http://<slurm-metric ip>:3000/. Once the web page loads, the dashboards can be accessed by clicking the dashboards icon on the left and selecting manage. This will bring you to a page listing all available dashboards. Click on a dashboard’s name to view it.

Grafana Home

Grafana Select Dashboards

Grafana Dashboards

The GPU dashboard provides the following metrics:

  • System Load
  • GPU Power Usage (by GPU)
  • GPU Power Total
  • System Memory Usage
  • GPU Temperature (by GPU)
  • GPU Average Temperature
  • PCIe Throughput
  • Disk Usage
  • GPU Memory Usage
  • GPU Utilization (by GPU)
  • GPU Total Utilization
  • Ethernet Throughput
  • Disk Throughput
  • GPU Mem Copy Utilization
  • GPU Average Memory Copy Utilization
  • Kernel
  • Hostname
  • System Total Power Draw
  • GPU SM Clocks
  • GPU Memory Clocks

Metrics are displayed per-node. To view utilization across nodes click the server selection drop-down in the top left and type or select the desired node names.

Grafana GPU Dashboard

Grafana GPU Node Selection

The Slurm dashboard provides the following information about running jobs and the Slurm job queue:

  • Nodes & Node Status
  • Slurm Jobs & Job State
  • Slurm Agent Queue Size
  • Scheduler Threads & Backfill Depth Mean & Cycles

Grafana Slurm Dashboard

Grafana allows you to view custom time-slices, set the polling duration, and many other dashboard customizations.

For more information on creating custom dashboards and the available GPU metrics provided by Prometheus refer to the GPU Monitoring Tools Readme or Grafana home page.

Prometheus

In order to make ad-hoc queries Prometheus can be used. Visit Prometheus at http://<slurm-metric ip>:9090/graph and select one or more metrics in the drop down. Metrics can be visualized using the Graph button, or the text output can be shown by selecting the Console button. Advanced queries can be constructed using the Prometheus syntax. After constructing a query click the execute button and inspect the results.

Prometheus Query

For more information on constructing Prometheus queries see the official Prometheus getting started guide.

Alertmanager

The Alertmanager handles alerts sent by client applications such as the Prometheus server, after adding alert rules from Prometheus, you can receive alerts via receivers like email, Slack, etc with Alertmanager.

Describes how to configure Alertmanager.

  • Configure the desired receiver in Alertmanager configuration yml. (slack, email, etc)
  • Configure the desired rules in Prometheus alert rules yml.
    • Alerting rules allow you to define alert conditions based on Prometheus expression language expressions and to send notifications about firing alerts to an external service.
    • Visit http://<slurm-metric ip>:9090/alerts to check the status of rules Prometheus alert rules

Visit http://<slurm-metric ip>:9093 for checking and setting about alerting. Alertmanager

For more information on constructing Alertmanager alerting see the official Alertmanager.