Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

containerd restart from nvidia-container-toolkit causes other daemonsets to get stuck #991

Open
chiragjn opened this issue Sep 13, 2024 · 2 comments

Comments

@chiragjn
Copy link

chiragjn commented Sep 13, 2024

Original context and jounrnalctl logs here: containerd/containerd#10437

As we know by default nvidia-container-toolkit sends a SIGHUP to containerd for the patched containerd config to take effect. Unfortunately the way gpu-operator schedules Daemonsets all at once, we have noticed our gpu discovery and nvidia device plugin pods get forever stuck in pending. This is primarily due to config-manager-init container getting stuck in Created and never transitioning to Running state due to containerd restart.

Timeline of race condition:

  • nvidia-container-toolkit and nvidia-device-plugin schedules
  • nvidia-device-plugin waits on toolkit-ready file validation via init container
  • Patches the config to update nvidia runtime
  • Sends SIGHUP and writes toolkit-ready file
  • config-manager-init container from nvidia-device-plugin pod enters Created state
  • containerd restarts
  • config-manager-init forever stuck in Created, hence device plugin never gets to start

Today the only way for us to recover is to manually delete the stuck daemonset pods.

While I understand at the core this is containerd issue but this has become so troublesome we are looking for entrypoint and node label hacks. We are willing to take a solution that allows us to modify the entrypoint configmaps of daemonsets managed by ClusterPolicy.

I think something similar was discovered here but different effect
963b8dc
and was fixed with a sleep

P.S. I am aware container-toolkit has an option to not restart containerd, but we need a restart for correct toolkit injection behavior

cc: @klueska

@ekeih
Copy link

ekeih commented Oct 10, 2024

Hi,

we are seeing the same issue with the gpu-operator-validator daemonset.

We found in the log of nvidia-container-toolkit-daemonset that it modified /etc/containerd/config.toml and then sends a SIGHUP to containerd:

nvidia-container-toolkit-daemonset-d9dr9 nvidia-container-toolkit-ctr time="2024-10-09T16:33:13Z" level=info msg="Sending SIGHUP signal to containerd"
nvidia-container-toolkit-daemonset-d9dr9 nvidia-container-toolkit-ctr time="2024-10-09T16:33:13Z" level=info msg="Successfully signaled containerd"
nvidia-container-toolkit-daemonset-d9dr9 nvidia-container-toolkit-ctr time="2024-10-09T16:33:13Z" level=info msg="Completed 'setup' for containerd"
nvidia-container-toolkit-daemonset-d9dr9 nvidia-container-toolkit-ctr time="2024-10-09T16:33:13Z" level=info msg="Waiting for signal"

Then in the middle of the creation of one of the init containers of the gpu-operator-validator daemonset the kubelet fails to communicate with the containerd socket because containerd restarts.
After a bunch of transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused errors from the kubelet we see the following in our journald log:

Oct 10 10:46:01 ip-10-3-101-224.ec2.internal systemd[1]: containerd.service holdoff time over, scheduling restart.
Oct 10 10:46:01 ip-10-3-101-224.ec2.internal systemd[1]: Stopping Kubernetes Kubelet...
Oct 10 10:46:01 ip-10-3-101-224.ec2.internal systemd[1]: Stopped Kubernetes Kubelet.
Oct 10 10:46:01 ip-10-3-101-224.ec2.internal systemd[1]: Stopped containerd container runtime.
Oct 10 10:46:01 ip-10-3-101-224.ec2.internal systemd[1]: Starting Load NVIDIA kernel modules...

This looks like systemd also decides to restart containerd after it should already have been restarted by the SIGHUP. We are unsure why this happens.

The stuck pod shows Warning Failed 24m kubelet Error: error reading from server: EOF in its events and the state of the pod shows the following for the plugin-validation init container:

    State:          Waiting
    Ready:          False
    Restart Count:  0

We are seeing this issue several times per day in our infrastructure. So if you have any ideas how to debug this further we should be able to reproduce it to provide more information.

Thanks in advance for any help :)

@justinthelaw
Copy link

justinthelaw commented Nov 7, 2024

I am also experiencing the similar thing when attempting a test/dev deployment on K3d (uses a K3s-cuda base image).

As part of the nvidia-container-toolkit's container installation of the toolkit onto the host, it sends a signal to restart containerd, which then cycles then entire cluster since containerd.service was restarted at a node's system-level.

If we disable the toolkit (toolkit.enabled: false) from the deployment and instead directly install the toolkit on the node, then it no longer cycles the entire cluster, and everything works fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants