Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: OCI runtime error: --device nvidia.com/gpu=all podman nvidia container #239

Closed
viprob-ai opened this issue Nov 8, 2024 · 4 comments
Assignees

Comments

@viprob-ai
Copy link

viprob-ai commented Nov 8, 2024

Hi @elezar

Client:       Podman Engine
Version:      5.3.0-dev
API Version:  5.3.0-dev
Go Version:   go1.23.2
Git Commit:   e0cd12ea8c71f369f02545b2cb2b1eb851762433
Built:        Thu Nov  7 00:21:46 2024
OS/Arch:      linux/amd64

cat /etc/containers/containers.conf

[engine]
helper_binaries_dir = ["/usr/local/bin/"]
runtime = "/usr/local/bin/crun"
podman : INFO
host:
  arch: amd64
  buildahVersion: 1.38.0-dev
  cgroupControllers:
  - memory
  - pids
  cgroupManager: cgroupfs
  cgroupVersion: v2
  conmon:
    package: conmon_2.0.25+ds1-1.1_amd64
    path: /usr/bin/conmon
    version: 'conmon version 2.0.25, commit: unknown'
  cpuUtilization:
    idlePercent: 97.32
    systemPercent: 1.1
    userPercent: 1.58
  cpus: 28
  databaseBackend: boltdb
  distribution:
    codename: jammy
    distribution: ubuntu
    version: "22.04"
  eventLogger: file
  freeLocks: 2040
  hostname: 01HW2485848
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 6.8.0-48-generic
  linkmode: dynamic
  logDriver: k8s-file
  memFree: 3786375168
  memTotal: 33319747584
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: Unknown
    package: Unknown
    path: /usr/local/bin/netavark
    version: netavark 1.12.2
  ociRuntime:
    name: /usr/local/bin/crun
    package: Unknown
    path: /usr/local/bin/crun
    version: |-
      crun version 1.18.2.0.0.0.1-0183
      commit: 01830cb038fe970fbd86856fe746fbea0eabfe28
      rundir: /run/user/1000/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
  os: linux
  pasta:
    executable: /usr/local/bin/pasta
    package: Unknown
    version: |
      pasta 2024_10_30.ee7d0b6-3-g5e93bcd
      Copyright Red Hat
      GNU General Public License, version 2 or later
        <https://www.gnu.org/licenses/old-licenses/gpl-2.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    exists: true
    path: /run/user/1000/podman/podman.sock
  rootlessNetworkCmd: pasta
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns_1.0.1-2_amd64
    version: |-
      slirp4netns version 1.0.1
      commit: 6a7b16babc95b6a3056b33fb45b74a6f62262dd4
      libslirp: 4.6.1
  swapFree: 2045767680
  swapTotal: 2046816256
  uptime: 2h 5m 26.00s (Approximately 0.08 days)
  variant: ""
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - docker.io
store:
  configFile: /home/user/.config/containers/storage.conf
  containerStore:
    number: 4
    paused: 0
    running: 1
    stopped: 3
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/user/.local/share/containers/storage
  graphRootAllocated: 981132795904
  graphRootUsed: 174910357504
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Supports shifting: "false"
    Supports volatile: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 33
  runRoot: /run/user/1000/containers
  transientStore: false
  volumePath: /home/user/.local/share/containers/storage/volumes
version:
  APIVersion: 5.3.0-dev
  Built: 1730919106
  BuiltTime: Thu Nov  7 00:21:46 2024
  GitCommit: e0cd12ea8c71f369f02545b2cb2b1eb851762433
  GoVersion: go1.23.2
  Os: linux
  OsArch: linux/amd64
  Version: 5.3.0-dev

FROM nvidia/cuda:11.8.0-runtime-ubuntu22.04 AS base
RUN apt-get update \
 && apt-get install -y -q --no-install-recommends \
  libglvnd0 \
  libgl1 \
  libglx0 \
  libgl1-mesa-dev mesa-utils libgl1-mesa-glx \
  #### Clean up
  && apt-get autoremove -y \
  && apt-get clean -y \
  && rm -rf /var/lib/apt/lists/* 

I build image out of above Dockerfile and tag it with image:dev

Now i have 2 images locally
1.nvidia/cuda:11.8.0-runtime-ubuntu22.04
2.image:dev

NOTE : i have followed everyting correctly as per nvidia cdi podman

if i run both container with same setting : the first case works (nvidia/cuda:11.8.0-runtime-ubuntu22.04) fine but for the built image(image:dev) it is giving nvidia hook error ?

podman run -it --device nvidia.com/gpu=all --security-opt=label=disable nvidia/cuda:11.8.0-runtime-ubuntu22.04 nvidia-smi -L

CASE 1: Output
==========
== CUDA ==
==========

CUDA Version 11.8.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

GPU 0: NVIDIA RTX 2000 Ada Generation Laptop GPU (UUID: GPU-b9873a9a-922f-f01b-8de9-9c4812a5e96f)

podman run -it --device nvidia.com/gpu=all --security-opt=label=disable image:dev nvidia-smi -L

CASE 2 : Output

Error: OCI runtime error: /usr/local/bin/crun: {"msg":"error executing hook `/usr/bin/nvidia-cdi-hook` (exit code: 1)","level":"error","time":"2024-11-08T05:39:18.170391Z"}

But if i comment ligl1 libglx0 libgl1-mesa ..stuff and build image again then it works just fine

FROM nvidia/cuda:11.8.0-runtime-ubuntu22.04 AS base
RUN apt-get update \
 && apt-get install -y -q --no-install-recommends \
  libglvnd0 \
  # comment these 
  #libgl1 \
  #libglx0 \
  #libgl1-mesa-dev mesa-utils libgl1-mesa-glx \
  #### Clean up
  && apt-get autoremove -y \
  && apt-get clean -y \
  && rm -rf /var/lib/apt/lists/* 
@elezar
Copy link
Contributor

elezar commented Nov 8, 2024

@viprob-ai there is a known issue in the NVIDIA Container Toolkit v1.17.0 that would trigger this behaviour. We have a fix in progress and will release an update early next week.

For now it is recommended to downgrade to the v1.16.2 nvidia-container-toolkit-base package.

@elezar elezar self-assigned this Nov 8, 2024
@elezar
Copy link
Contributor

elezar commented Nov 8, 2024

@viprob-ai
Copy link
Author

@elezar

Thanks for quick reponse.
Downgrading to the v1.16.2 nvidia-container-toolkit-base package resolve the issue.

@elezar
Copy link
Contributor

elezar commented Nov 14, 2024

Thanks @viprob-ai. Note that we have also released v1.17.1 which should also address this issue.

Please open an issue against the NVIDIA Container Toolkit if you have additional problems.

@elezar elezar closed this as completed Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants