Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grafana customizable GPU uuid #124

Open
Cyria7 opened this issue Oct 10, 2023 · 3 comments
Open

Grafana customizable GPU uuid #124

Cyria7 opened this issue Oct 10, 2023 · 3 comments

Comments

@Cyria7
Copy link

Cyria7 commented Oct 10, 2023

First of all I would like to say this repo is a great work and it partly solved my requirements.
Since my lab's compute cards are all distributed under different hosts with different ip's, I can't very clearly tell which gpu belongs to which server by uuid. So I'm wondering that is the name of the switching gpu in the top left corner of the dashboard customizable?

@abbottjlu
Copy link

The cmd nvidia-smi -L can output,

GPU 0: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-0Af501e0-4fa6-bc45-5993-92f20a1f76Xe)
GPU 1: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-1A61a9a-19f7-9670-bcc2-c0e3a6074cX8)
GPU 2: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-aAc30fe0e-d40c-2fb3-a4d5-1136a76066X7)
GPU 3: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-8Adeb867-b3ee-37b9-b81b-2c432e01cbX4)

To enhance readability, mapping the UUID to GPU indices (e.g., GPU 0, GPU 1) would be beneficial.

Furthermore, considering that GPUs might be distributed across multiple servers,
a notation like GPU X@hostname would provide more context and clarity.

@utkuozdemir
Copy link
Owner

The cmd nvidia-smi -L can output,

GPU 0: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-0Af501e0-4fa6-bc45-5993-92f20a1f76Xe)
GPU 1: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-1A61a9a-19f7-9670-bcc2-c0e3a6074cX8)
GPU 2: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-aAc30fe0e-d40c-2fb3-a4d5-1136a76066X7)
GPU 3: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-8Adeb867-b3ee-37b9-b81b-2c432e01cbX4)

To enhance readability, mapping the UUID to GPU indices (e.g., GPU 0, GPU 1) would be beneficial.

Furthermore, considering that GPUs might be distributed across multiple servers, a notation like GPU X@hostname would provide more context and clarity.

Those indexes are probably not consistent, i.e., they can come in different order. Even if they don't most of the time, swap the PCI slots of 2 GPUs and then I bet they'd appear in different order.

So I'm not sure tbh if this exporter should address this. I think you can achieve what you need by relabel_configs of Prometheis.

@abbottjlu
Copy link

The cmd nvidia-smi -L can output,

GPU 0: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-0Af501e0-4fa6-bc45-5993-92f20a1f76Xe)
GPU 1: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-1A61a9a-19f7-9670-bcc2-c0e3a6074cX8)
GPU 2: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-aAc30fe0e-d40c-2fb3-a4d5-1136a76066X7)
GPU 3: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-8Adeb867-b3ee-37b9-b81b-2c432e01cbX4)

To enhance readability, mapping the UUID to GPU indices (e.g., GPU 0, GPU 1) would be beneficial.
Furthermore, considering that GPUs might be distributed across multiple servers, a notation like GPU X@hostname would provide more context and clarity.

Those indexes are probably not consistent, i.e., they can come in different order. Even if they don't most of the time, swap the PCI slots of 2 GPUs and then I bet they'd appear in different order.

So I'm not sure tbh if this exporter should address this. I think you can achieve what you need by relabel_configs of Prometheis.

When running then cmd sinfo -N -O NodeList,CPUsState,CPUsLoad,Memory,FreeMem,AllocMem,Partition,StateCompact,Gres:25,GresUsed:40 | column -t,
I consistently see the following output:

NODELIST  CPUS(A/I/O/T)  CPU_LOAD  MEMORY   FREE_MEM  ALLOCMEM  PARTITION  STATE  GRES                  GRES_USED
gpu01     0/48/0/48      0.08      192676   181531    0         dev        idle   gpu:geforce:8(S:0-1)  gpu:geforce:0(IDX:N/A)
gpu02     31/17/0/48     8.29      192676   159145    62000     gpu*       mix    gpu:geforce:8(S:0-1)  gpu:geforce:8(IDX:0-7)
gpu03     48/0/0/48      2.26      192676   162526    96000     gpu*       alloc  gpu:geforce:8(S:0-1)  gpu:geforce:2(IDX:0,4)
gpu04     31/17/0/48     8.17      192676   160330    62000     gpu*       mix    gpu:geforce:8(S:0-1)  gpu:geforce:8(IDX:0-7)
gpu05     47/1/0/48      10.02     192676   157138    105552    gpu*       mix    gpu:geforce:8(S:0-1)  gpu:geforce:5(IDX:0,4-7)
gpu06     8/40/0/48      8.18      192676   169780    16000     gpu*       mix    gpu:geforce:8(S:0-1)  gpu:geforce:8(IDX:0-7)
gpu07     13/35/0/48     13.20     192676   169248    37552     gpu*       mix    gpu:geforce:8(S:0-1)  gpu:geforce:8(IDX:0-7)
gpu08     38/10/0/48     15.18     192676   159122    99104     gpu*       mix    gpu:geforce:8(S:0-1)  gpu:geforce:5(IDX:0-2,4,7)

I hypothesized that the IDX value would remain unchanged across reboots.
Given the stability of my hardware setup, I believe the IDX should be deterministic.

I understand that the database currently uses GPU UUIDs as primary keys.
Would it be feasible to retain UUID-based storage while dynamically resolving the corresponding IDX at query time?

I think you can achieve what you need by relabel_configs of Prometheis.

Thanks for the suggestion. I'm still relatively new to Grafana/Prometheus (about three days in), so I'll need to do some more research on relabel_configs to figure out how to implement it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants