Add drm collector to start having metrics from AMD GPUs #190

gabrielcocenza · 2024-09-25T19:17:16Z

Issue

Start collecting GPU metrics for monitoring data centers that uses AMD GPUs

Solution

Enable the drm collector that is disabled by default

Context

Even that not all servers have AMD GPUs the impact in terms of performance for time of collecting metrics wasn't perceptive.

On a machine that does not have AMD GPUs it will barely increase the number of metrics:

# HELP node_drm_card_info Card information
# TYPE node_drm_card_info gauge
node_drm_card_info{card="card1",memory_vendor="",power_performance_level="",unique_id="",vendor="amd"} 1
node_drm_card_info{card="card2",memory_vendor="",power_performance_level="",unique_id="",vendor="amd"} 1
# HELP node_drm_gpu_busy_percent How busy the GPU is as a percentage.
# TYPE node_drm_gpu_busy_percent gauge
node_drm_gpu_busy_percent{card="card1"} 0
node_drm_gpu_busy_percent{card="card2"} 0
# HELP node_drm_memory_gtt_size_bytes The size of the graphics translation table (GTT) block in bytes.
# TYPE node_drm_memory_gtt_size_bytes gauge
node_drm_memory_gtt_size_bytes{card="card1"} 0
node_drm_memory_gtt_size_bytes{card="card2"} 0
# HELP node_drm_memory_gtt_used_bytes The used amount of the graphics translation table (GTT) block in bytes.
# TYPE node_drm_memory_gtt_used_bytes gauge
node_drm_memory_gtt_used_bytes{card="card1"} 0
node_drm_memory_gtt_used_bytes{card="card2"} 0
# HELP node_drm_memory_vis_vram_size_bytes The size of visible VRAM in bytes.
# TYPE node_drm_memory_vis_vram_size_bytes gauge
node_drm_memory_vis_vram_size_bytes{card="card1"} 0
node_drm_memory_vis_vram_size_bytes{card="card2"} 0
# HELP node_drm_memory_vis_vram_used_bytes The used amount of visible VRAM in bytes.
# TYPE node_drm_memory_vis_vram_used_bytes gauge
node_drm_memory_vis_vram_used_bytes{card="card1"} 0
node_drm_memory_vis_vram_used_bytes{card="card2"} 0
# HELP node_drm_memory_vram_size_bytes The size of VRAM in bytes.
# TYPE node_drm_memory_vram_size_bytes gauge
node_drm_memory_vram_size_bytes{card="card1"} 0
node_drm_memory_vram_size_bytes{card="card2"} 0
# HELP node_drm_memory_vram_used_bytes The used amount of VRAM in bytes.
# TYPE node_drm_memory_vram_used_bytes gauge
node_drm_memory_vram_used_bytes{card="card1"} 0
node_drm_memory_vram_used_bytes{card="card2"} 0
node_scrape_collector_duration_seconds{collector="drm"} 7.0185e-05
node_scrape_collector_success{collector="drm"} 1

Even that not all servers have AMD GPUs the impact in terms of performance for time of collecting metrics is not considerable.

sed-i · 2024-09-26T16:22:40Z

Thanks @gabrielcocenza!
Could you please confirm the collector behaves as you expect when run inside a vm?
grafana-agent-operator is a subordinate charm, so runs in a VM. Would your gpu setup be reflected inside the vm?

aieri · 2024-09-26T18:26:56Z

On a physical host with an AMD GPU and node_exporter 1.8.2:

$ ./node_exporter-1.8.2.linux-amd64/node_exporter --collector.drm

$ curl -s localhost:9100/metrics | grep -vE '^#' | grep drm
node_drm_card_info{card="card0",memory_vendor="",power_performance_level="auto",unique_id="",vendor="amd"} 1
node_drm_gpu_busy_percent{card="card0"} 1
node_drm_memory_gtt_size_bytes{card="card0"} 3.1410860032e+10
node_drm_memory_gtt_used_bytes{card="card0"} 3.54484224e+08
node_drm_memory_vis_vram_size_bytes{card="card0"} 4.294967296e+09
node_drm_memory_vis_vram_used_bytes{card="card0"} 3.274203136e+09
node_drm_memory_vram_size_bytes{card="card0"} 4.294967296e+09
node_drm_memory_vram_used_bytes{card="card0"} 3.274203136e+09
node_scrape_collector_duration_seconds{collector="drm"} 0.000275032
node_scrape_collector_success{collector="drm"} 1

On a VM with a plain Virtio GPU and no /sys/class/drm directory:

$ curl -s 10.142.199.123:9100/metrics | grep -vE '^#' | grep drm
node_scrape_collector_duration_seconds{collector="drm"} 7.927e-06
node_scrape_collector_success{collector="drm"} 1

Deezzir · 2024-10-02T22:32:46Z

Can someone review it? Thanks.

sed-i · 2024-10-07T21:22:03Z

@Deezzir is the behavior reported by @aieri what you expected / planned on using?

Depending on scale, extra 18 metrics per node could become non-negligible, and it's difficult to draw the line.

There are many "disabled by default" collectors that we don't want to enable on our side. An easy solution is:

"enabled by default" collectors are always enabled.
"disabled by default" collectors to have a bool config option added on demand, set to true by default.

@simskij you had a strong opinion about conditional collectors in the past. Wdyt?

Deezzir · 2024-10-07T21:43:12Z

Yes, it's actually what I need. You can see an example of the metrics used here canonical/hardware-observer-operator#332. I agree that that amount of metrics will not affect the overall performance.

aieri · 2024-10-07T22:21:32Z

A few remarks:

in the absence of a GPU using an open source driver (i.e. a server using an NVIDIA GPU), the collector produces (almost) no metrics - so there wouldn't be much value in autodetecting on the charm side when to enable the drm collector
the collector is simply reading files in sysfs, so the queries on the producing side are very lightweight
if you do have an AMD GPU in your deployment I imagine you probably do want to monitor it
we looked at commit history of node exporter but couldn't find a clear reason why the drm collector is not enabled by default

sed-i · 2024-10-17T15:13:49Z

Hey @gabrielcocenza
Just had a char with @simskij and we concluded that we'd like to go with the config option approach.
Could you please add a bool config option with a default "false" for the drm collector?

aieri · 2024-10-17T18:44:48Z

Are you sure adding a config option is the right approach here? I think that if we were to do so we'd instantly create the need for a juju-lint rule that checks whether you have it set to true when you are also deploying hardware-observer.

If you have concerns about enabling the drm collector by default I would propose two alternative paths:

enable the drm collector only if an AMD gpu is detected on the system
extend the cos_agent relation to allow principal charms to pass a list of node_exporter collectors they would want to see enabled, then make hardware-observer request the drm collector when it detects an AMD gpu on the system

We are happy to propose changes for either, depending on what you'd prefer.

Obviously I am aware that this charm is or is soon to be deprecated, so if the idea is that the config option would only be an ok shortcut in this specific case then cool we can do that. But I don't think it would be the best approach if grafana-agent was here to stay.

sed-i · 2024-10-18T15:12:12Z

Had a chat with @gabrielcocenza and the team and decided to always enable it.

Add drm collector to start having metrics from AMD GPUs

6e35ff0

Even that not all servers have AMD GPUs the impact in terms of performance for time of collecting metrics is not considerable.

gabrielcocenza requested a review from a team as a code owner September 25, 2024 19:17

This was referenced Oct 1, 2024

Add GPU dashboard canonical/hardware-observer-operator#313

Closed

Add vendor-agnostic GPU dashboard draft canonical/hardware-observer-operator#332

Merged

sed-i approved these changes Oct 18, 2024

View reviewed changes

sed-i merged commit 51df185 into canonical:main Oct 18, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add drm collector to start having metrics from AMD GPUs #190

Add drm collector to start having metrics from AMD GPUs #190

gabrielcocenza commented Sep 25, 2024

sed-i commented Sep 26, 2024

aieri commented Sep 26, 2024

Deezzir commented Oct 2, 2024

sed-i commented Oct 7, 2024

Deezzir commented Oct 7, 2024

aieri commented Oct 7, 2024

sed-i commented Oct 17, 2024

aieri commented Oct 17, 2024

sed-i commented Oct 18, 2024

Add drm collector to start having metrics from AMD GPUs #190

Add drm collector to start having metrics from AMD GPUs #190

Conversation

gabrielcocenza commented Sep 25, 2024

Issue

Solution

Context

sed-i commented Sep 26, 2024

aieri commented Sep 26, 2024

Deezzir commented Oct 2, 2024

sed-i commented Oct 7, 2024

Deezzir commented Oct 7, 2024

aieri commented Oct 7, 2024

sed-i commented Oct 17, 2024

aieri commented Oct 17, 2024

sed-i commented Oct 18, 2024