Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add drm collector to start having metrics from AMD GPUs #190

Merged
merged 1 commit into from
Oct 18, 2024

Conversation

gabrielcocenza
Copy link
Member

Issue

Start collecting GPU metrics for monitoring data centers that uses AMD GPUs

Solution

Enable the drm collector that is disabled by default

Context

Even that not all servers have AMD GPUs the impact in terms of performance for time of collecting metrics wasn't perceptive.

On a machine that does not have AMD GPUs it will barely increase the number of metrics:

# HELP node_drm_card_info Card information
# TYPE node_drm_card_info gauge
node_drm_card_info{card="card1",memory_vendor="",power_performance_level="",unique_id="",vendor="amd"} 1
node_drm_card_info{card="card2",memory_vendor="",power_performance_level="",unique_id="",vendor="amd"} 1
# HELP node_drm_gpu_busy_percent How busy the GPU is as a percentage.
# TYPE node_drm_gpu_busy_percent gauge
node_drm_gpu_busy_percent{card="card1"} 0
node_drm_gpu_busy_percent{card="card2"} 0
# HELP node_drm_memory_gtt_size_bytes The size of the graphics translation table (GTT) block in bytes.
# TYPE node_drm_memory_gtt_size_bytes gauge
node_drm_memory_gtt_size_bytes{card="card1"} 0
node_drm_memory_gtt_size_bytes{card="card2"} 0
# HELP node_drm_memory_gtt_used_bytes The used amount of the graphics translation table (GTT) block in bytes.
# TYPE node_drm_memory_gtt_used_bytes gauge
node_drm_memory_gtt_used_bytes{card="card1"} 0
node_drm_memory_gtt_used_bytes{card="card2"} 0
# HELP node_drm_memory_vis_vram_size_bytes The size of visible VRAM in bytes.
# TYPE node_drm_memory_vis_vram_size_bytes gauge
node_drm_memory_vis_vram_size_bytes{card="card1"} 0
node_drm_memory_vis_vram_size_bytes{card="card2"} 0
# HELP node_drm_memory_vis_vram_used_bytes The used amount of visible VRAM in bytes.
# TYPE node_drm_memory_vis_vram_used_bytes gauge
node_drm_memory_vis_vram_used_bytes{card="card1"} 0
node_drm_memory_vis_vram_used_bytes{card="card2"} 0
# HELP node_drm_memory_vram_size_bytes The size of VRAM in bytes.
# TYPE node_drm_memory_vram_size_bytes gauge
node_drm_memory_vram_size_bytes{card="card1"} 0
node_drm_memory_vram_size_bytes{card="card2"} 0
# HELP node_drm_memory_vram_used_bytes The used amount of VRAM in bytes.
# TYPE node_drm_memory_vram_used_bytes gauge
node_drm_memory_vram_used_bytes{card="card1"} 0
node_drm_memory_vram_used_bytes{card="card2"} 0
node_scrape_collector_duration_seconds{collector="drm"} 7.0185e-05
node_scrape_collector_success{collector="drm"} 1

Even that not all servers have AMD GPUs the impact in terms of
performance for time of collecting metrics is not considerable.
@gabrielcocenza gabrielcocenza requested a review from a team as a code owner September 25, 2024 19:17
@sed-i
Copy link
Contributor

sed-i commented Sep 26, 2024

Thanks @gabrielcocenza!
Could you please confirm the collector behaves as you expect when run inside a vm?
grafana-agent-operator is a subordinate charm, so runs in a VM. Would your gpu setup be reflected inside the vm?

@aieri
Copy link

aieri commented Sep 26, 2024

On a physical host with an AMD GPU and node_exporter 1.8.2:

$ ./node_exporter-1.8.2.linux-amd64/node_exporter --collector.drm

$ curl -s localhost:9100/metrics | grep -vE '^#' | grep drm
node_drm_card_info{card="card0",memory_vendor="",power_performance_level="auto",unique_id="",vendor="amd"} 1
node_drm_gpu_busy_percent{card="card0"} 1
node_drm_memory_gtt_size_bytes{card="card0"} 3.1410860032e+10
node_drm_memory_gtt_used_bytes{card="card0"} 3.54484224e+08
node_drm_memory_vis_vram_size_bytes{card="card0"} 4.294967296e+09
node_drm_memory_vis_vram_used_bytes{card="card0"} 3.274203136e+09
node_drm_memory_vram_size_bytes{card="card0"} 4.294967296e+09
node_drm_memory_vram_used_bytes{card="card0"} 3.274203136e+09
node_scrape_collector_duration_seconds{collector="drm"} 0.000275032
node_scrape_collector_success{collector="drm"} 1

On a VM with a plain Virtio GPU and no /sys/class/drm directory:

$ curl -s 10.142.199.123:9100/metrics | grep -vE '^#' | grep drm
node_scrape_collector_duration_seconds{collector="drm"} 7.927e-06
node_scrape_collector_success{collector="drm"} 1

@Deezzir
Copy link

Deezzir commented Oct 2, 2024

Can someone review it? Thanks.

@sed-i
Copy link
Contributor

sed-i commented Oct 7, 2024

@Deezzir is the behavior reported by @aieri what you expected / planned on using?

Depending on scale, extra 18 metrics per node could become non-negligible, and it's difficult to draw the line.

There are many "disabled by default" collectors that we don't want to enable on our side. An easy solution is:

  • "enabled by default" collectors are always enabled.
  • "disabled by default" collectors to have a bool config option added on demand, set to true by default.

@simskij you had a strong opinion about conditional collectors in the past. Wdyt?

@Deezzir
Copy link

Deezzir commented Oct 7, 2024

Yes, it's actually what I need. You can see an example of the metrics used here canonical/hardware-observer-operator#332. I agree that that amount of metrics will not affect the overall performance.

@aieri
Copy link

aieri commented Oct 7, 2024

A few remarks:

  • in the absence of a GPU using an open source driver (i.e. a server using an NVIDIA GPU), the collector produces (almost) no metrics - so there wouldn't be much value in autodetecting on the charm side when to enable the drm collector
  • the collector is simply reading files in sysfs, so the queries on the producing side are very lightweight
  • if you do have an AMD GPU in your deployment I imagine you probably do want to monitor it
  • we looked at commit history of node exporter but couldn't find a clear reason why the drm collector is not enabled by default

@sed-i
Copy link
Contributor

sed-i commented Oct 17, 2024

Hey @gabrielcocenza
Just had a char with @simskij and we concluded that we'd like to go with the config option approach.
Could you please add a bool config option with a default "false" for the drm collector?

@aieri
Copy link

aieri commented Oct 17, 2024

Are you sure adding a config option is the right approach here? I think that if we were to do so we'd instantly create the need for a juju-lint rule that checks whether you have it set to true when you are also deploying hardware-observer.

If you have concerns about enabling the drm collector by default I would propose two alternative paths:

  1. enable the drm collector only if an AMD gpu is detected on the system
  2. extend the cos_agent relation to allow principal charms to pass a list of node_exporter collectors they would want to see enabled, then make hardware-observer request the drm collector when it detects an AMD gpu on the system

We are happy to propose changes for either, depending on what you'd prefer.

Obviously I am aware that this charm is or is soon to be deprecated, so if the idea is that the config option would only be an ok shortcut in this specific case then cool we can do that. But I don't think it would be the best approach if grafana-agent was here to stay.

@sed-i
Copy link
Contributor

sed-i commented Oct 18, 2024

Had a chat with @gabrielcocenza and the team and decided to always enable it.

@sed-i sed-i merged commit 51df185 into canonical:main Oct 18, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants