Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RTX4090 - Fedora 41 (LLM AI) Unable to determine the device handle for GPU0000:01:00.0: Unknown Error #846

Open
kagaho opened this issue Jan 6, 2025 · 0 comments

Comments

@kagaho
Copy link

kagaho commented Jan 6, 2025

hi Team , I am having an issue with RTX4090 and Fedora41. It was working fine since implementation until during an embedding model work for document inference from a container(running in gpu), went into issues as below, fan speed pretty high, but overall temp didn’t exceed 65C(this temp was only seen at this system at the time of issue, temp is normally 24C.
The container runs small Embedding model for embedding documents into a vector database. Same type of loads runs pretty normal at a T4 or A10G.
No monitor attached.

root@fedora41:~# nvidia-smi

Unable to determine the device handle for GPU0000:01:00.0: Unknown Error

root@fedora41:~# nvidia-debugdump --dumpall

ERROR: GetCaptureBufferSize failed, GPU is lost, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0xf
ERROR: internal_dumpSystemComponent() failed, return code: 0xf
ERROR: GetCaptureBufferSize failed, GPU is lost, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0xf
ERROR: internal_dumpSystemComponent() failed, return code: 0xf
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7

/etc/modprobe.d# cat nvidia.conf
options nvidia NVreg_PreserveVideoMemoryAllocations=1
options nvidia-drm modeset=1 fbdev=1

nvidia-bug-report.log.gz (608.3 KB)

cdi-spec.yaml.tgz

user@fedora41:~$ nvidia-ctk cdi generate --device-name-strategy=uuid --output cdi-spec.yaml

INFO[0000] Using /usr/lib64/libnvidia-ml.so.565.77
INFO[0000] Using /usr/lib64/libnvidia-sandboxutils.so.565.77
INFO[0000] Auto-detected mode as ‘nvml’
INFO[0000] Using driver version 565.77
WARN[0000] Could not locate /dev/nvidia-modeset: pattern /dev/nvidia-modeset not found
INFO[0000] Selecting /dev/nvidia-uvm-tools as /dev/nvidia-uvm-tools
INFO[0000] Selecting /dev/nvidia-uvm as /dev/nvidia-uvm
INFO[0000] Selecting /dev/nvidiactl as /dev/nvidiactl
INFO[0000] Selecting /usr/lib64/libnvidia-egl-gbm.so.1.1.2 as /usr/lib64/libnvidia-egl-gbm.so.1.1.2
INFO[0000] Selecting /usr/lib64/libnvidia-egl-wayland.so.1.1.17 as /usr/lib64/libnvidia-egl-wayland.so.1.1.17
INFO[0000] Selecting /usr/lib64/libnvidia-allocator.so.565.77 as /usr/lib64/libnvidia-allocator.so.565.77
WARN[0000] Could not locate libnvidia-vulkan-producer.so.565.77: pattern libnvidia-vulkan-producer.so.565.77 not found
libnvidia-vulkan-producer.so.565.77: not found
INFO[0000] Selecting /usr/lib64/xorg/modules/drivers/nvidia_drv.so as /usr/lib64/xorg/modules/drivers/nvidia_drv.so
INFO[0000] Selecting /usr/lib64/xorg/modules/extensions/libglxserver_nvidia.so.565.77 as /usr/lib64/xorg/modules/extensions/libglxserver_nvidia.so.565.77
INFO[0000] Selecting /usr/share/glvnd/egl_vendor.d/10_nvidia.json as /usr/share/glvnd/egl_vendor.d/10_nvidia.json
INFO[0000] Selecting /usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json as /usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json
INFO[0000] Selecting /usr/share/egl/egl_external_platform.d/10_nvidia_wayland.json as /usr/share/egl/egl_external_platform.d/10_nvidia_wayland.json
INFO[0000] Selecting /usr/share/nvidia/nvoptix.bin as /usr/share/nvidia/nvoptix.bin
WARN[0000] Could not locate X11/xorg.conf.d/10-nvidia.conf: pattern X11/xorg.conf.d/10-nvidia.conf not found
INFO[0000] Selecting /usr/share/X11/xorg.conf.d/nvidia-drm-outputclass.conf as /usr/share/X11/xorg.conf.d/nvidia-drm-outputclass.conf
INFO[0000] Selecting /etc/vulkan/icd.d/nvidia_icd.json as /etc/vulkan/icd.d/nvidia_icd.json
WARN[0000] Could not locate vulkan/icd.d/nvidia_layers.json: pattern vulkan/icd.d/nvidia_layers.json not found
pattern vulkan/icd.d/nvidia_layers.json not found
INFO[0000] Selecting /etc/vulkan/implicit_layer.d/nvidia_layers.json as /etc/vulkan/implicit_layer.d/nvidia_layers.json
INFO[0000] Selecting /usr/lib64/libEGL_nvidia.so.565.77 as /usr/lib64/libEGL_nvidia.so.565.77
INFO[0000] Selecting /usr/lib64/libGLESv1_CM_nvidia.so.565.77 as /usr/lib64/libGLESv1_CM_nvidia.so.565.77
INFO[0000] Selecting /usr/lib64/libGLESv2_nvidia.so.565.77 as /usr/lib64/libGLESv2_nvidia.so.565.77
INFO[0000] Selecting /usr/lib64/libGLX_nvidia.so.565.77 as /usr/lib64/libGLX_nvidia.so.565.77
INFO[0000] Selecting /usr/lib64/libcuda.so.565.77 as /usr/lib64/libcuda.so.565.77
INFO[0000] Selecting /usr/lib64/libcudadebugger.so.565.77 as /usr/lib64/libcudadebugger.so.565.77
INFO[0000] Selecting /usr/lib64/libnvcuvid.so.565.77 as /usr/lib64/libnvcuvid.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-allocator.so.565.77 as /usr/lib64/libnvidia-allocator.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-cfg.so.565.77 as /usr/lib64/libnvidia-cfg.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-eglcore.so.565.77 as /usr/lib64/libnvidia-eglcore.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-encode.so.565.77 as /usr/lib64/libnvidia-encode.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-fbc.so.565.77 as /usr/lib64/libnvidia-fbc.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-glcore.so.565.77 as /usr/lib64/libnvidia-glcore.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-glsi.so.565.77 as /usr/lib64/libnvidia-glsi.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-glvkspirv.so.565.77 as /usr/lib64/libnvidia-glvkspirv.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-gpucomp.so.565.77 as /usr/lib64/libnvidia-gpucomp.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-gtk2.so.565.77 as /usr/lib64/libnvidia-gtk2.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-gtk3.so.565.77 as /usr/lib64/libnvidia-gtk3.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-ml.so.565.77 as /usr/lib64/libnvidia-ml.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-ngx.so.565.77 as /usr/lib64/libnvidia-ngx.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-nvvm.so.565.77 as /usr/lib64/libnvidia-nvvm.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-opencl.so.565.77 as /usr/lib64/libnvidia-opencl.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-opticalflow.so.565.77 as /usr/lib64/libnvidia-opticalflow.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-pkcs11-openssl3.so.565.77 as /usr/lib64/libnvidia-pkcs11-openssl3.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-pkcs11.so.565.77 as /usr/lib64/libnvidia-pkcs11.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-ptxjitcompiler.so.565.77 as /usr/lib64/libnvidia-ptxjitcompiler.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-rtcore.so.565.77 as /usr/lib64/libnvidia-rtcore.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-sandboxutils.so.565.77 as /usr/lib64/libnvidia-sandboxutils.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-tls.so.565.77 as /usr/lib64/libnvidia-tls.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-vksc-core.so.565.77 as /usr/lib64/libnvidia-vksc-core.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-wayland-client.so.565.77 as /usr/lib64/libnvidia-wayland-client.so.565.77
INFO[0000] Selecting /usr/lib64/libnvoptix.so.565.77 as /usr/lib64/libnvoptix.so.565.77
INFO[0000] Selecting /usr/lib64/vdpau/libvdpau_nvidia.so.565.77 as /usr/lib64/vdpau/libvdpau_nvidia.so.565.77
WARN[0000] Could not locate /nvidia-persistenced/socket: pattern /nvidia-persistenced/socket not found
WARN[0000] Could not locate /nvidia-fabricmanager/socket: pattern /nvidia-fabricmanager/socket not found
WARN[0000] Could not locate /tmp/nvidia-mps: pattern /tmp/nvidia-mps not found
INFO[0000] Selecting /lib/firmware/nvidia/565.77/gsp_ga10x.bin as /lib/firmware/nvidia/565.77/gsp_ga10x.bin
INFO[0000] Selecting /lib/firmware/nvidia/565.77/gsp_tu10x.bin as /lib/firmware/nvidia/565.77/gsp_tu10x.bin
INFO[0000] Selecting /usr/bin/nvidia-smi as /usr/bin/nvidia-smi
INFO[0000] Selecting /usr/bin/nvidia-debugdump as /usr/bin/nvidia-debugdump
INFO[0000] Selecting /usr/bin/nvidia-persistenced as /usr/bin/nvidia-persistenced
INFO[0000] Selecting /usr/bin/nvidia-cuda-mps-control as /usr/bin/nvidia-cuda-mps-control
INFO[0000] Selecting /usr/bin/nvidia-cuda-mps-server as /usr/bin/nvidia-cuda-mps-server
INFO[0000] Generated CDI spec with version 0.8.0

nvidia-bug-report.log.gz

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant