Race issue after node reboot #1221

SchSeba · 2024-02-01T17:06:17Z

Hi, it looks like there is an issue after a node reboot where we can have a race in multus that will prevent the pod from starting

kubectl -n kube-system logs -f kube-multus-ds-ml62q -c install-multus-binary
cp: cannot create regular file '/host/opt/cni/bin/multus-shim': Text file busy

The problem is mainly after reboot that the multus-shim gets called by crio to start pods but the multus pod is not able to start because the init container fails to cp the shim.
The reason it failed to copy is because crio called the shim who is stuck waiting for the communication with the pod

[root@virtual-worker-0 centos]# lsof /opt/cni/bin/multus-shim
COMMAND    PID USER  FD   TYPE DEVICE SIZE/OFF     NODE NAME
multus-sh 8682 root txt    REG  252,1 46760102 46241656 /opt/cni/bin/multus-shim
[root@virtual-worker-0 centos]# ps -ef | grep mult
root        8682     936  0 16:27 ?        00:00:00 /opt/cni/bin/multus-shim
root        9082    7247  0 16:28 pts/0    00:00:00 grep --color=auto mult

The text was updated successfully, but these errors were encountered:

SchSeba · 2024-02-01T17:08:23Z

[root@virtual-worker-0 centos]# ps -ef | grep 942
root         942       1  5 17:07 ?        00:00:00 /usr/bin/crio
root        1246     942  0 17:07 ?        00:00:00 /opt/cni/bin/multus-shim
root        2745    2395  0 17:08 pts/0    00:00:00 grep --color=auto 942

from crio:

from CNI network \"multus-cni-network\": plugin type=\"multus-shim\" name=\"multus-cni-network\" failed (delete): netplugin failed with no error message: signal: killed"

SchSeba · 2024-02-01T17:21:33Z

just update doing -f looks like fix the issue in the copy command

rrpolanco · 2024-02-02T14:13:28Z

Coincidentally, we also saw this error crop up yesterday with one of our edge clusters after rebooting.

adrianchiris · 2024-02-04T14:28:56Z

As an FYI i see different deployment yamls use different way to copy the cni binary in init container:

the first one[1] will use install_multus which will copy files in an atomic manner. the latter[2] will just use cp.
(install_multus support both thick and thin plugin types)

although im not sure that copying file atomically will solve the above issue.

see:
[1]

multus-cni/deployments/multus-daemonset.yml

Line 207 in 8e5060b

command: ["/install_multus"]

and
[2]
https://github.com/k8snetworkplumbingwg/multus-cni/blob/8e5060b9a7612044b7bf927365bbdbb8f6cde451/deployments/multus-daemonset-thick.yml#L199C9-L204C46

also deployments/multus-daemonset-crio.yml does not use init contianer.

dougbtv · 2024-02-15T14:55:11Z

This should hopefully be addressed with #1213

kfox1111 · 2024-02-29T01:24:34Z

Saw this in minikube today. No rebooting, just staring up a new minikube cluster.

dougbtv · 2024-04-02T11:15:25Z

I also got a reproduction after rebooting a node and having multus restart.

I mitigated it by deleting /opt/cni/bin/multus-shim, but, yeah, I'll retest with the above patch

[fedora@labkubedualhost-master-1 whereabouts]$ watch -n1 kubectl get pods -A -o wide
[fedora@labkubedualhost-master-1 whereabouts]$ kubectl apply -f https://raw.githubusercontent.com/k8snetworkplumbingwg/multus-cni/master/deployments/multus-daemonset-thick.yml
customresourcedefinition.apiextensions.k8s.io/network-attachment-definitions.k8s.cni.cncf.io created
clusterrole.rbac.authorization.k8s.io/multus created
clusterrolebinding.rbac.authorization.k8s.io/multus created
serviceaccount/multus created
configmap/multus-daemon-config created
daemonset.apps/kube-multus-ds created
[fedora@labkubedualhost-master-1 whereabouts]$ watch -n1 kubectl get pods -A -o wide
[fedora@labkubedualhost-master-1 whereabouts]$ kubectl logs kube-multus-ds-fzdcr -n kube-system
Defaulted container "kube-multus" out of: kube-multus, install-multus-binary (init)
Error from server (BadRequest): container "kube-multus" in pod "kube-multus-ds-fzdcr" is waiting to start: PodInitializing

dustinrouillard · 2024-04-19T06:09:41Z

Seems I can make this happen anytime I ungracefully restart a node, worker or master it creates this error and stops pod network sandbox recreation completely on that node.

The fix mentioned above does work, but this likely means a power outage of a node will require manual intervention whereas otherwise without multus this is not required, this error should be handled properly.

kfox1111 · 2024-05-28T20:05:47Z

+1. This seems like a pretty serious issue. Can we get a fix merge for it soon please?

tomroffe · 2024-06-07T09:56:16Z

Additionally can confirm this behavior. as @dougbtv mentioned... removing /opt/cni/bin/multus-shim works as a workaround.

ulbi · 2024-06-15T04:57:43Z

+1 happend to me as well, cluster did not come up. Any chance to fix this soon?

stefb69 · 2024-06-18T16:56:00Z

same here, cluster kubespray 1.29

javen-yan · 2024-07-09T06:57:00Z

Certainly need to fix right away.

haiwu · 2024-08-16T22:22:45Z

@dougbtv : Hit exactly the same issue. It helps by deleting /opt/cni/bin/multus-shim.

when could this be fixed?

reski-rukmantiyo · 2024-09-04T11:33:00Z

Hit the same issue with kube-ovn. Already posted it there (kubeovn/kube-ovn#4470)
Only happen when I force delete the kube-ovn pod.

SchSeba mentioned this issue Feb 4, 2024

add w/a for multus bug in CI k8snetworkplumbingwg/sriov-network-operator#612

Merged

dougbtv linked a pull request Feb 15, 2024 that will close this issue

update deploy file: use install_multus bin to update cni file #1213

Open

rbo mentioned this issue Jul 2, 2024

AAP failed/stuck job due to pod networking problem stormshift/support#187

Closed

seastco mentioned this issue Jul 19, 2024

Race condition on node startup causing Pods to get stuck in ContainerCreating #1312

Closed

haiwu mentioned this issue Sep 3, 2024

Spiderpool and Multus spidernet-io/spiderpool#3892

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race issue after node reboot #1221

Race issue after node reboot #1221

SchSeba commented Feb 1, 2024

SchSeba commented Feb 1, 2024 •

edited

Loading

SchSeba commented Feb 1, 2024

rrpolanco commented Feb 2, 2024 •

edited

Loading

adrianchiris commented Feb 4, 2024 •

edited

Loading

dougbtv commented Feb 15, 2024

kfox1111 commented Feb 29, 2024

dougbtv commented Apr 2, 2024

dustinrouillard commented Apr 19, 2024

kfox1111 commented May 28, 2024

tomroffe commented Jun 7, 2024

ulbi commented Jun 15, 2024

stefb69 commented Jun 18, 2024

javen-yan commented Jul 9, 2024

haiwu commented Aug 16, 2024

reski-rukmantiyo commented Sep 4, 2024

Race issue after node reboot #1221

Race issue after node reboot #1221

Comments

SchSeba commented Feb 1, 2024

SchSeba commented Feb 1, 2024 • edited Loading

SchSeba commented Feb 1, 2024

rrpolanco commented Feb 2, 2024 • edited Loading

adrianchiris commented Feb 4, 2024 • edited Loading

dougbtv commented Feb 15, 2024

kfox1111 commented Feb 29, 2024

dougbtv commented Apr 2, 2024

dustinrouillard commented Apr 19, 2024

kfox1111 commented May 28, 2024

tomroffe commented Jun 7, 2024

ulbi commented Jun 15, 2024

stefb69 commented Jun 18, 2024

javen-yan commented Jul 9, 2024

haiwu commented Aug 16, 2024

reski-rukmantiyo commented Sep 4, 2024

SchSeba commented Feb 1, 2024 •

edited

Loading

rrpolanco commented Feb 2, 2024 •

edited

Loading

adrianchiris commented Feb 4, 2024 •

edited

Loading