[Bug]: DinD doesn't allow passing `--gpus` flag #1855

mtaran · 2024-10-17T05:39:58Z

Steps to reproduce

make a repro.dstack.yml with:

type: task
name: my-repro-task
image: dstackai/dind:latest
privileged: true
commands:
  - start-dockerd
  - sleep infinity
resources:
  cpu: 4..
  memory: 6GB..
  gpu:
    count: 1

dstack apply -f repro.dstack.yml -y
in another terminal: dstack attach my-repro-task
in yet another terminal: ssh my-repro-task
in the ssh session, try to run docker run --rm --gpus=all hello-world

Actual behaviour

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: mount error: mount operation failed: /usr/bin/nvidia-smi: no such file or directory: unknown.

Expected behaviour

The container should run, with access to all the GPUs of the host.

dstack version

0.18.18

Server logs

No response

Additional information

No response

The text was updated successfully, but these errors were encountered:

peterschmidt85 · 2024-10-17T06:48:02Z

@mtaran Is the issue still relevant? You mentioned you made it work.

un-def · 2024-10-18T10:18:32Z

As TensorDock is “a marketplace of independent hosts”, the setups are not consistent at all.

In addition to

nvidia-container-cli: mount error: mount operation failed: /usr/bin/nvidia-smi: no such file or directory: unknown.

i've got

nvidia-container-cli: mount error: mount operation failed: /usr/bin/nv-fabricmanager: no such file or directory: unknown

and

nvidia-container-cli: mount error: mount operation failed: /usr/bin/nvidia-persistenced: no such file or directory: unknown

when requesting instances with the same resources configuration.

NVIDIA/CUDA driver versions also vary:

NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2

and

NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4

on two NVIDIA RTX A4000 instances in the same region.

mtaran added the bug Something isn't working label Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: DinD doesn't allow passing `--gpus` flag #1855

[Bug]: DinD doesn't allow passing `--gpus` flag #1855

mtaran commented Oct 17, 2024

peterschmidt85 commented Oct 17, 2024

un-def commented Oct 18, 2024 •

edited

Loading

[Bug]: DinD doesn't allow passing --gpus flag #1855

[Bug]: DinD doesn't allow passing --gpus flag #1855

Comments

mtaran commented Oct 17, 2024

Steps to reproduce

Actual behaviour

Expected behaviour

dstack version

Server logs

Additional information

peterschmidt85 commented Oct 17, 2024

un-def commented Oct 18, 2024 • edited Loading

[Bug]: DinD doesn't allow passing `--gpus` flag #1855

[Bug]: DinD doesn't allow passing `--gpus` flag #1855

un-def commented Oct 18, 2024 •

edited

Loading