Failed to initialize NVML: could not load NVML library. #36

zbjjyy · 2024-03-27T08:01:07Z

ENV :

K8s : v1.23.10
Runtime: docker 20.10.8
NVIDIA System Management Interface -- v535.161.07
Image: 4pdosc/k8s-device-plugin:v0.10.0.4-ubuntu20.04

Issue:

after deploy the plugin ds ,the logs shows:

2024/03/27 15:41:13 Loading PciInfo

 0 = 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)

 1 = 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]

 2 = 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]

 3 = 00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)

 4 = 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)

 5 = 00:02.0 VGA compatible controller: Cirrus Logic GD 5446

 6 = 00:03.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge

 7 = 00:04.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge

 8 = 00:05.0 Ethernet controller: Red Hat, Inc. Virtio network device

 9 = 00:06.0 Audio device: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) High Definition Audio Controller (rev 01)

 10 = 00:07.0 SCSI storage controller: Red Hat, Inc. Virtio block device

 11 = 00:08.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)

 found 00:08.0

 12 = 00:09.0 Unclassified device [00ff]: Red Hat, Inc. Virtio memory balloon

 13 = 

 pcibusstr= 00:08.0


 2024/03/27 15:41:13 Loading NVML

 2024/03/27 15:41:13 Failed to initialize NVML: could not load NVML library.

 2024/03/27 15:41:13 If this is a GPU node, did you set the docker default runtime to `nvidia`?

 2024/03/27 15:41:13 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites

 2024/03/27 15:41:13 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start

 2024/03/27 15:41:13 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes

I have checked the env, and nvidia-smi works on the vm

root@master:/usr/local/vgpu# nvidia-smi 
Wed Mar 27 15:46:02 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-32GB           Off | 00000000:00:08.0 Off |                    0 |
| N/A   31C    P0              23W / 300W |      0MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

The text was updated successfully, but these errors were encountered:

zbjjyy · 2024-03-28T03:01:25Z

done！
just caused by docker's daemon.json
"runtimes" has been set ,but "default-runtime" don't
add like that

{
"default-runtime": "nvidia"
}

restart docker

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to initialize NVML: could not load NVML library. #36

Failed to initialize NVML: could not load NVML library. #36

zbjjyy commented Mar 27, 2024

zbjjyy commented Mar 28, 2024 •

edited

Loading

Failed to initialize NVML: could not load NVML library. #36

Failed to initialize NVML: could not load NVML library. #36

Comments

zbjjyy commented Mar 27, 2024

ENV :

Issue:

zbjjyy commented Mar 28, 2024 • edited Loading

zbjjyy commented Mar 28, 2024 •

edited

Loading