You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Following gpu-operator documentation, those things will happen:
gpu-operator will write containerd config into /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
rke2 will pick it up as a template and make dedicated contained config: /var/lib/rancher/rke2/agent/etc/containerd/config.toml
cluster will not get up after reboot, since the config provided by gpu-operator does not work with rke2
The most significant errors in the logs would be:
Sep 13 14:08:23 rke2 rke2[10318]: time="2024-09-13T14:08:23Z" level=info msg="Pod for etcd not synced (pod sandbox has changed), retrying"
Sep 13 14:08:23 rke2 rke2[10318]: time="2024-09-13T14:08:23Z" level=info msg="Waiting for API server to become available"
Sep 13 14:08:25 rke2 rke2[10318]: time="2024-09-13T14:08:25Z" level=warning msg="Failed to list nodes with etcd role: runtime core not ready"
Sep 13 14:08:25 rke2 rke2[10318]: time="2024-09-13T14:08:25Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
Following RKE2 docs about passing only CONTAINERD_SOCKET works, since gpu-operator will write it's (not working with rke2 config) into /etc/containerd/config.toml, even though containerd is not installed at the OS level.
root@rke2:~# apt list --installed | grep containerd
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
root@rke2:~#
Looks like the containerd config, provided by gpu-operator with RKE2, doesn't matter since RKE2 is able to detect nvidia-container-runtime and configure it's own containerd conifg with nvidia runtime class:
I'm fairly confident that using the 560 driver, or any driver covered in the product docs, is OK.
However, I'd like SME input from my teammates. When I followed the RKE doc, I've found that I need to specify runtimeClassName--like the sample nbody workload. I can't choose what other people prefer or dislike, but I happen to dislike that approach.
@mikemckiernan I think it's due gpu-operator setting nvidia runtime class as the default in containerd. RKE2 just adds another runtime, which in my opinion is more clear approach. I don't know why gpu-operator have this option, maybe it's due to be consistent with docker? I remember that long time ago I needed to install nvidia runtime for docker and change default docker runtime for nvidia to make it work.
If the gpu-operator would work normally with RKE2, so creating valid config.toml.tmpl, nvidia runtime class would be the default, when CONTAINERD_SET_AS_DEFAULT=true.
RKE2 docs says only about passing the config for RKE2's internal CONTAINERD_SOCKET: https://docs.rke2.io/advanced?_highlight=gpu#deploy-nvidia-operator
Nvidia's also about CONTAINERD_CONFIG: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#rancher-kubernetes-engine-2
Following gpu-operator documentation, those things will happen:
/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
/var/lib/rancher/rke2/agent/etc/containerd/config.toml
The most significant errors in the logs would be:
Following RKE2 docs about passing only CONTAINERD_SOCKET works, since gpu-operator will write it's (not working with rke2 config) into
/etc/containerd/config.toml
, even though containerd is not installed at the OS level.Looks like the containerd config, provided by gpu-operator with RKE2, doesn't matter since RKE2 is able to detect
nvidia-container-runtime
and configure it's own containerd conifg with nvidia runtime class:Steps to reproduce on Ubuntu 22.04:
Following Nvidia's docs breaks RKE2 cluster after reboot:
Following RKE2's docs works fine:
Could someone verify the docs?
The text was updated successfully, but these errors were encountered: