Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: DinD doesn't allow passing --gpus flag #1855

Open
mtaran opened this issue Oct 17, 2024 · 2 comments
Open

[Bug]: DinD doesn't allow passing --gpus flag #1855

mtaran opened this issue Oct 17, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@mtaran
Copy link
Contributor

mtaran commented Oct 17, 2024

Steps to reproduce

  1. make a repro.dstack.yml with:
type: task
name: my-repro-task
image: dstackai/dind:latest
privileged: true
commands:
  - start-dockerd
  - sleep infinity
resources:
  cpu: 4..
  memory: 6GB..
  gpu:
    count: 1
  1. dstack apply -f repro.dstack.yml -y
  2. in another terminal: dstack attach my-repro-task
  3. in yet another terminal: ssh my-repro-task
  4. in the ssh session, try to run docker run --rm --gpus=all hello-world

Actual behaviour

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: mount error: mount operation failed: /usr/bin/nvidia-smi: no such file or directory: unknown.

Expected behaviour

The container should run, with access to all the GPUs of the host.

dstack version

0.18.18

Server logs

No response

Additional information

No response

@mtaran mtaran added the bug Something isn't working label Oct 17, 2024
@peterschmidt85
Copy link
Contributor

@mtaran Is the issue still relevant? You mentioned you made it work.

@un-def
Copy link
Collaborator

un-def commented Oct 18, 2024

As TensorDock is “a marketplace of independent hosts”, the setups are not consistent at all.

In addition to

nvidia-container-cli: mount error: mount operation failed: /usr/bin/nvidia-smi: no such file or directory: unknown.

i've got

nvidia-container-cli: mount error: mount operation failed: /usr/bin/nv-fabricmanager: no such file or directory: unknown

and

nvidia-container-cli: mount error: mount operation failed: /usr/bin/nvidia-persistenced: no such file or directory: unknown

when requesting instances with the same resources configuration.

NVIDIA/CUDA driver versions also vary:

NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2

and

NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4   

on two NVIDIA RTX A4000 instances in the same region.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants