Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to initialize NVML: Unknown Error #110

Open
jangrewe opened this issue May 24, 2023 · 4 comments
Open

Failed to initialize NVML: Unknown Error #110

jangrewe opened this issue May 24, 2023 · 4 comments

Comments

@jangrewe
Copy link

Describe the bug
I'm running the current version of your Docker image, and it works most of the time - but sometimes it starts to fail, and i need to restart the container.
It sometimes runs for a whole day, and sometimes only a couple of minutes.

To Reproduce
Steps to reproduce the behavior:

  1. Systemd Unit ExecStart:
/usr/bin/docker run --name prometheus-nvidia-gpu-exporter \
  --gpus all \
  -p 9835:9835 \
  -v /dev/nvidiactl:/dev/nvidiactl \
  -v /dev/nvidia0:/dev/nvidia0 \
  -v /usr/lib/x86_64-linux-gnu/libnvidia-ml.so:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so \
  -v /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 \
  -v /usr/bin/nvidia-smi:/usr/bin/nvidia-smi \
  utkuozdemir/nvidia_gpu_exporter:1.2.0

Expected behavior
I'd expect the exporter to not start throwing errors ;-)

Console output
(Disregard the mismatching timestamps, i copypasta'd the error first, and then also added the initial log when starting the container.)

May 24 19:01:22 hades systemd[1]: Stopped Prometheus Nvidia GPU Exporter.
May 24 19:01:22 hades systemd[1]: Starting Prometheus Nvidia GPU Exporter...
May 24 19:01:22 hades docker[1915038]: prometheus-nvidia-gpu-exporter
May 24 19:01:23 hades docker[1915048]: 1.2.0: Pulling from utkuozdemir/nvidia_gpu_exporter
May 24 19:01:23 hades docker[1915048]: Digest: sha256:cc407f77ab017101ce233a0185875ebc75d2a0911381741b20ad91f695e488c7
May 24 19:01:23 hades docker[1915048]: Status: Image is up to date for utkuozdemir/nvidia_gpu_exporter:1.2.0
May 24 19:01:23 hades docker[1915048]: docker.io/utkuozdemir/nvidia_gpu_exporter:1.2.0
May 24 19:01:23 hades systemd[1]: Started Prometheus Nvidia GPU Exporter.
May 24 19:01:24 hades docker[1915066]: ts=2023-05-24T17:01:24.380Z caller=tls_config.go:232 level=info msg="Listening on" address=[::]:9835
May 24 19:01:24 hades docker[1915066]: ts=2023-05-24T17:01:24.380Z caller=tls_config.go:235 level=info msg="TLS is disabled." http2=false address=[::]:9835
[...]
May 24 19:00:45 hades docker[1903720]: ts=2023-05-24T17:00:45.428Z caller=exporter.go:184 level=error error="error running command: exit status 255: command failed. code: 255 | command: nvidia-smi --query-gpu=timestamp,driver_version,vgpu_driver_capability.heterogenous_multivGPU,count,name,serial,uuid,pci.bus_id,pci.domain,pci.bus,pci.device,pci.device_id,pci.sub_device_id,vgpu_device_capability.fractional_multiVgpu,vgpu_device_capability.heterogeneous_timeSlice_profile,vgpu_device_capability.heterogeneous_timeSlice_sizes,pcie.link.gen.current,pcie.link.gen.gpucurrent,pcie.link.gen.max,pcie.link.gen.gpumax,pcie.link.gen.hostmax,pcie.link.width.current,pcie.link.width.max,index,display_mode,display_active,persistence_mode,accounting.mode,accounting.buffer_size,driver_model.current,driver_model.pending,vbios_version,inforom.img,inforom.oem,inforom.ecc,inforom.pwr,gom.current,gom.pending,fan.speed,pstate,clocks_throttle_reasons.supported,clocks_throttle_reasons.active,clocks_throttle_reasons.gpu_idle,clocks_throttle_reasons.applications_clocks_setting,clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.hw_slowdown,clocks_throttle_reasons.hw_thermal_slowdown,clocks_throttle_reasons.hw_power_brake_slowdown,clocks_throttle_reasons.sw_thermal_slowdown,clocks_throttle_reasons.sync_boost,memory.total,memory.reserved,memory.used,memory.free,compute_mode,compute_cap,utilization.gpu,utilization.memory,encoder.stats.sessionCount,encoder.stats.averageFps,encoder.stats.averageLatency,ecc.mode.current,ecc.mode.pending,ecc.errors.corrected.volatile.device_memory,ecc.errors.corrected.volatile.dram,ecc.errors.corrected.volatile.register_file,ecc.errors.corrected.volatile.l1_cache,ecc.errors.corrected.volatile.l2_cache,ecc.errors.corrected.volatile.texture_memory,ecc.errors.corrected.volatile.cbu,ecc.errors.corrected.volatile.sram,ecc.errors.corrected.volatile.total,ecc.errors.corrected.aggregate.device_memory,ecc.errors.corrected.aggregate.dram,ecc.errors.corrected.aggregate.register_file,ecc.errors.corrected.aggregate.l1_cache,ecc.errors.corrected.aggregate.l2_cache,ecc.errors.corrected.aggregate.texture_memory,ecc.errors.corrected.aggregate.cbu,ecc.errors.corrected.aggregate.sram,ecc.errors.corrected.aggregate.total,ecc.errors.uncorrected.volatile.device_memory,ecc.errors.uncorrected.volatile.dram,ecc.errors.uncorrected.volatile.register_file,ecc.errors.uncorrected.volatile.l1_cache,ecc.errors.uncorrected.volatile.l2_cache,ecc.errors.uncorrected.volatile.texture_memory,ecc.errors.uncorrected.volatile.cbu,ecc.errors.uncorrected.volatile.sram,ecc.errors.uncorrected.volatile.total,ecc.errors.uncorrected.aggregate.device_memory,ecc.errors.uncorrected.aggregate.dram,ecc.errors.uncorrected.aggregate.register_file,ecc.errors.uncorrected.aggregate.l1_cache,ecc.errors.uncorrected.aggregate.l2_cache,ecc.errors.uncorrected.aggregate.texture_memory,ecc.errors.uncorrected.aggregate.cbu,ecc.errors.uncorrected.aggregate.sram,ecc.errors.uncorrected.aggregate.total,retired_pages.single_bit_ecc.count,retired_pages.double_bit.count,retired_pages.pending,temperature.gpu,temperature.memory,power.management,power.draw,power.draw.average,power.draw.instant,power.limit,enforced.power.limit,power.default_limit,power.min_limit,power.max_limit,clocks.current.graphics,clocks.current.sm,clocks.current.memory,clocks.current.video,clocks.applications.graphics,clocks.applications.memory,clocks.default_applications.graphics,clocks.default_applications.memory,clocks.max.graphics,clocks.max.sm,clocks.max.memory,mig.mode.current,mig.mode.pending,fabric.state,fabric.status --format=csv | stdout: Failed to initialize NVML: Unknown Error\n | stderr: "

(The error from the title is at the end of this very long last line.)

Model and Version

  • GPU Model: RTX 4070 Ti
  • App version: 1.2.0 am64
  • Installation method: Docker image
  • Operating System: Debian 11/bullseye
  • Nvidia GPU driver version:

Running on Docker with Nvidia Container Toolkit:

$ docker info
Client: Docker Engine - Community
 Version:    24.0.1
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.10.4
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx

Server:
 Containers: 84
  Running: 83
  Paused: 0
  Stopped: 1
 Images: 87
 Server Version: 24.0.1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 nvidia runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 3dce8eb055cbb6872793272b4f20ed16117344f8
 runc version: v1.1.7-0-g860f061
 init version: de40ad0
 Security Options:
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.10.0-23-amd64
 Operating System: Debian GNU/Linux 11 (bullseye)
 OSType: linux
 Architecture: x86_64
 CPUs: 32
 Total Memory: 125.7GiB
 Docker Root Dir: /srv/docker
 Debug Mode: false
 Experimental: true
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: true
$ dpkg -l | grep nvidia
ii  libnvidia-container-tools             1.13.1-1                                                                   amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64            1.13.1-1                                                                   amd64        NVIDIA container runtime library
ii  nvidia-container-toolkit              1.13.1-1                                                                   amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base         1.13.1-1                                                                   amd64        NVIDIA Container Toolkit Base
$ nvidia-smi
Wed May 24 19:10:49 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.04   Driver Version: 525.116.04   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:42:00.0 Off |                  N/A |
|  0%   56C    P2    34W / 285W |   5122MiB / 12282MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    937698      C   /usr/bin/zmc                      225MiB |
|    0   N/A  N/A   3332933      C   python3                          1838MiB |
|    0   N/A  N/A   3469008      C   python                           3056MiB |
+-----------------------------------------------------------------------------+
@utkuozdemir
Copy link
Owner

It seems the error on stdout is

Failed to initialize NVML: Unknown Error

When I Google the error, I find very similar issus such as:

Can you have a look at them? I don't think this is an issue with the exporter, because the exporter is just a dumb tool running nvidia-smi command each time it it probed.

@nicklausbrown
Copy link

I've tried getting a number of nvidia tools working on Docker before, and I think I see something in your Docker info that could be the problem @jangrewe. Namely while you have the nvidia runtime set, it is not your default. Perhaps that is the issue?

Runtimes: io.containerd.runc.v2 nvidia runc
Default Runtime: runc

@jangrewe
Copy link
Author

Thanks @nicklausbrown, i'll try running with --runtime nvidia --privileged to see if that fixes the intermittent errors - maybe using the proper runtime doesn't cause nvidia-smi and/or the exporter to trip up. 🙂

@y3ti
Copy link

y3ti commented Oct 2, 2023

Here is very good explanation of this issue: NVIDIA/nvidia-docker#1730

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants