-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add drm collector to start having metrics from AMD GPUs #190
Conversation
Even that not all servers have AMD GPUs the impact in terms of performance for time of collecting metrics is not considerable.
Thanks @gabrielcocenza! |
On a physical host with an AMD GPU and node_exporter 1.8.2:
On a VM with a plain Virtio GPU and no
|
Can someone review it? Thanks. |
@Deezzir is the behavior reported by @aieri what you expected / planned on using? Depending on scale, extra 18 metrics per node could become non-negligible, and it's difficult to draw the line. There are many "disabled by default" collectors that we don't want to enable on our side. An easy solution is:
@simskij you had a strong opinion about conditional collectors in the past. Wdyt? |
Yes, it's actually what I need. You can see an example of the metrics used here canonical/hardware-observer-operator#332. I agree that that amount of metrics will not affect the overall performance. |
A few remarks:
|
Hey @gabrielcocenza |
Are you sure adding a config option is the right approach here? I think that if we were to do so we'd instantly create the need for a juju-lint rule that checks whether you have it set to true when you are also deploying hardware-observer. If you have concerns about enabling the drm collector by default I would propose two alternative paths:
We are happy to propose changes for either, depending on what you'd prefer. Obviously I am aware that this charm is or is soon to be deprecated, so if the idea is that the config option would only be an ok shortcut in this specific case then cool we can do that. But I don't think it would be the best approach if grafana-agent was here to stay. |
Had a chat with @gabrielcocenza and the team and decided to always enable it. |
Issue
Start collecting GPU metrics for monitoring data centers that uses AMD GPUs
Solution
Enable the
drm
collector that is disabled by defaultContext
Even that not all servers have AMD GPUs the impact in terms of performance for time of collecting metrics wasn't perceptive.
On a machine that does not have AMD GPUs it will barely increase the number of metrics: