Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using nvidia-container-cli with rootless gVisor #288

Closed
sfc-gh-lshi opened this issue Nov 8, 2024 · 1 comment
Closed

Using nvidia-container-cli with rootless gVisor #288

sfc-gh-lshi opened this issue Nov 8, 2024 · 1 comment

Comments

@sfc-gh-lshi
Copy link

sfc-gh-lshi commented Nov 8, 2024

I'm attempting to use GPUs with gVisor in rootless mode - for now, the container just runs nvidia-smi- and I am running into issues originating from nvidia-container-cli configure. Various sources allude to using GPUs in rootless mode being possible in Docker and Podman, though I've also found issues like #49 where it hasn't worked with runc.

nvidia-container-cli configure is invoked in gVisor at https://github.com/google/gvisor/blob/54359c5b5fbb354f52866e0ff745b09543af2fc9/runsc/container/container.go#L2023-L2031; you can see that --no-cgroups is already passed.

To get some more information for debugging, I built gVIsor myself and prefixed the nvidia-container-cli configure invocation with /usr/bin/strace -f.

I started with this command to run a container.

$ unshare -mUr runsc --nvproxy --strace --debug --debug-log=/tmp/rootless-logs/runsc.log \
    --network=host --host-uds=all --rootless --ignore-cgroups run "container"
[pid 347931] setgroups(1, [65534])      = -1 EPERM (Operation not permitted)

The same error was observed in #104 and resolved by making perm_drop_privileges return 0. I believe a simpler route is to add the "--user=root:root" flag when invoking nvidia-container-cli. I tried both, and in both cases the error then became this for me:

[pid 351446] setns(3, CLONE_NEWNS)      = -1 EPERM (Operation not permitted)

To resolve this, I added -m to unshare, ending up with an error in one of the last steps of configure:

$ unshare -mUr runsc --nvproxy --strace --debug --debug-log=/tmp/rootless-logs/runsc.log \
    --network=host --host-uds=all --rootless --ignore-cgroups run "container"
[pid 351749] mount(NULL, "/proc", "proc", MS_RDONLY, NULL) = -1 EPERM (Operation not permitted)

The code flow to this mount is as follows:

Unfortunately, I'm stuck here and out of ideas. Is there anything else I can try to make this work? Were any of my previous steps incorrect?

Additional Info

For specifics about how I set up the container, please see google/gvisor#11069.

@sfc-gh-lshi
Copy link
Author

Seems like we can avoid the final mount error by not using the host ldconfig. In other words, gVisor should pass '/sbin/ldconfig.real' instead of '@/sbin/ldconfig.real'.

With that, this works!

$ unshare -mUr runsc --nvproxy --strace --debug --debug-log=/tmp/rootless-logs/runsc.log --network=host --host-uds=all --rootless --ignore-cgroups run "container"
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4fc4b8a1-42c0-3b68-1802-afa12592d554)
GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-311ad3a2-9f0f-d3b4-e5ee-05fb4938f421)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant