Grafana customizable GPU uuid #124

Cyria7 · 2023-10-10T06:24:33Z

First of all I would like to say this repo is a great work and it partly solved my requirements.
Since my lab's compute cards are all distributed under different hosts with different ip's, I can't very clearly tell which gpu belongs to which server by uuid. So I'm wondering that is the name of the switching gpu in the top left corner of the dashboard customizable?

abbottjlu · 2024-08-16T13:53:30Z

The cmd nvidia-smi -L can output,

GPU 0: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-0Af501e0-4fa6-bc45-5993-92f20a1f76Xe)
GPU 1: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-1A61a9a-19f7-9670-bcc2-c0e3a6074cX8)
GPU 2: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-aAc30fe0e-d40c-2fb3-a4d5-1136a76066X7)
GPU 3: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-8Adeb867-b3ee-37b9-b81b-2c432e01cbX4)

To enhance readability, mapping the UUID to GPU indices (e.g., GPU 0, GPU 1) would be beneficial.

Furthermore, considering that GPUs might be distributed across multiple servers,
a notation like GPU X@hostname would provide more context and clarity.

utkuozdemir · 2024-08-16T14:10:10Z

The cmd nvidia-smi -L can output,
GPU 0: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-0Af501e0-4fa6-bc45-5993-92f20a1f76Xe)
GPU 1: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-1A61a9a-19f7-9670-bcc2-c0e3a6074cX8)
GPU 2: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-aAc30fe0e-d40c-2fb3-a4d5-1136a76066X7)
GPU 3: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-8Adeb867-b3ee-37b9-b81b-2c432e01cbX4)
To enhance readability, mapping the UUID to GPU indices (e.g., GPU 0, GPU 1) would be beneficial.

Furthermore, considering that GPUs might be distributed across multiple servers, a notation like GPU X@hostname would provide more context and clarity.

Those indexes are probably not consistent, i.e., they can come in different order. Even if they don't most of the time, swap the PCI slots of 2 GPUs and then I bet they'd appear in different order.

So I'm not sure tbh if this exporter should address this. I think you can achieve what you need by relabel_configs of Prometheis.

abbottjlu · 2024-08-18T14:39:08Z

The cmd nvidia-smi -L can output,
GPU 0: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-0Af501e0-4fa6-bc45-5993-92f20a1f76Xe)
GPU 1: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-1A61a9a-19f7-9670-bcc2-c0e3a6074cX8)
GPU 2: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-aAc30fe0e-d40c-2fb3-a4d5-1136a76066X7)
GPU 3: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-8Adeb867-b3ee-37b9-b81b-2c432e01cbX4)
To enhance readability, mapping the UUID to GPU indices (e.g., GPU 0, GPU 1) would be beneficial.
Furthermore, considering that GPUs might be distributed across multiple servers, a notation like GPU X@hostname would provide more context and clarity.
Those indexes are probably not consistent, i.e., they can come in different order. Even if they don't most of the time, swap the PCI slots of 2 GPUs and then I bet they'd appear in different order.

So I'm not sure tbh if this exporter should address this. I think you can achieve what you need by relabel_configs of Prometheis.

When running then cmd sinfo -N -O NodeList,CPUsState,CPUsLoad,Memory,FreeMem,AllocMem,Partition,StateCompact,Gres:25,GresUsed:40 | column -t,
I consistently see the following output:

NODELIST  CPUS(A/I/O/T)  CPU_LOAD  MEMORY   FREE_MEM  ALLOCMEM  PARTITION  STATE  GRES                  GRES_USED
gpu01     0/48/0/48      0.08      192676   181531    0         dev        idle   gpu:geforce:8(S:0-1)  gpu:geforce:0(IDX:N/A)
gpu02     31/17/0/48     8.29      192676   159145    62000     gpu*       mix    gpu:geforce:8(S:0-1)  gpu:geforce:8(IDX:0-7)
gpu03     48/0/0/48      2.26      192676   162526    96000     gpu*       alloc  gpu:geforce:8(S:0-1)  gpu:geforce:2(IDX:0,4)
gpu04     31/17/0/48     8.17      192676   160330    62000     gpu*       mix    gpu:geforce:8(S:0-1)  gpu:geforce:8(IDX:0-7)
gpu05     47/1/0/48      10.02     192676   157138    105552    gpu*       mix    gpu:geforce:8(S:0-1)  gpu:geforce:5(IDX:0,4-7)
gpu06     8/40/0/48      8.18      192676   169780    16000     gpu*       mix    gpu:geforce:8(S:0-1)  gpu:geforce:8(IDX:0-7)
gpu07     13/35/0/48     13.20     192676   169248    37552     gpu*       mix    gpu:geforce:8(S:0-1)  gpu:geforce:8(IDX:0-7)
gpu08     38/10/0/48     15.18     192676   159122    99104     gpu*       mix    gpu:geforce:8(S:0-1)  gpu:geforce:5(IDX:0-2,4,7)

I hypothesized that the IDX value would remain unchanged across reboots.
Given the stability of my hardware setup, I believe the IDX should be deterministic.

I understand that the database currently uses GPU UUIDs as primary keys.
Would it be feasible to retain UUID-based storage while dynamically resolving the corresponding IDX at query time?

I think you can achieve what you need by relabel_configs of Prometheis.

Thanks for the suggestion. I'm still relatively new to Grafana/Prometheus (about three days in), so I'll need to do some more research on relabel_configs to figure out how to implement it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grafana customizable GPU uuid #124

Grafana customizable GPU uuid #124

Cyria7 commented Oct 10, 2023

abbottjlu commented Aug 16, 2024

utkuozdemir commented Aug 16, 2024

abbottjlu commented Aug 18, 2024

Grafana customizable GPU uuid #124

Grafana customizable GPU uuid #124

Comments

Cyria7 commented Oct 10, 2023

abbottjlu commented Aug 16, 2024

utkuozdemir commented Aug 16, 2024

abbottjlu commented Aug 18, 2024