-
Notifications
You must be signed in to change notification settings - Fork 454
性能测试环境ovs ovn Pod出现Crash定位
oilbeater edited this page Jun 27, 2022
·
2 revisions
Wiki 下的中文文档将不在维护,请访问我们最新的中文文档网站,获取最新的文档更新。
在性能测试环境上,运行大量Pod,数量约10k。查看Pod信息,ovs-ovn出现较多crash。但是在部分Node上的ovs-ovn Pod状态是正常的Running状态。
[root@node10 ~]#kubectl get pod -n kube-system -o wide
ovs-ovn-572sv 1/1 Running 0 16m 10.0.128.224 10.0.128.224 <none> <none>
ovs-ovn-5hbkf 1/1 Running 0 29m 10.0.128.93 10.0.128.93 <none> <none>
ovs-ovn-7dzsz 0/1 CrashLoopBackOff 205 12d 10.0.128.83 10.0.128.83 <none> <none>
ovs-ovn-7tzzm 0/1 CrashLoopBackOff 203 12d 10.0.128.201 10.0.128.201 <none> <none>
ovs-ovn-8tvwk 0/1 Running 123 12d 10.0.128.243 10.0.128.243 <none> <none>
ovs-ovn-cb88d 0/1 Running 136 8d 10.0.128.198 10.0.128.198 <none> <none>
ovs-ovn-ck8s9 0/1 CrashLoopBackOff 199 12d 10.0.128.103 10.0.128.103 <none> <none>
ovs-ovn-cm65q 0/1 CrashLoopBackOff 127 12d 10.0.129.29 10.0.129.29 <none> <none>
ovs-ovn-fqq86 0/1 CrashLoopBackOff 90 7d23h 10.0.128.113 10.0.128.113 <none> <none>
ovs-ovn-kpn6h 1/1 Running 221 12d 10.0.129.12 10.0.129.12 <none> <none>
ovs-ovn-mclpn 0/1 Running 124 12d 10.0.128.35 10.0.128.35 <none> <none>
ovs-ovn-nnwbd 1/1 Running 275 12d 10.0.129.20 10.0.129.20 <none> <none>
ovs-ovn-stxxc 0/1 CrashLoopBackOff 234 12d 10.0.129.77 10.0.129.77 <none> <none>
ovs-ovn-v4sz9 1/1 Running 120 12d 10.0.129.154 10.0.129.154 <none> <none>
ovs-ovn-vhnpn 0/1 Running 225 12d 10.0.129.4 10.0.129.4 <none> <none>
ovs-ovn-xksl6 1/1 Running 0 20m 10.0.129.148 10.0.129.148 <none> <none>
选取其中一个crash的Pod,查看log信息,提示ovn-sb连接有问题
[root@node10 ~]# kubectl logs ovs-ovn-7dzsz -n kube-system
sleep 10 seconds, waiting for ovn-sb 192.170.173.26:6642 ready
sleep 10 seconds, waiting for ovn-sb 192.170.173.26:6642 ready
sleep 10 seconds, waiting for ovn-sb 192.170.173.26:6642 ready
sleep 10 seconds, waiting for ovn-sb 192.170.173.26:6642 ready
[root@node10 ~]#
其中192.170.173.26:6642 地址,是ovn-sb提供的Service地址
[root@node10 ~]# kubectl get svc -n kube-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-dns ClusterIP 192.170.0.10 <none> 53/UDP,53/TCP,9153/TCP 12d
kube-ovn-cni ClusterIP 192.170.119.10 <none> 10665/TCP 12d
kube-ovn-controller ClusterIP 192.170.164.145 <none> 10660/TCP 12d
kube-ovn-pinger ClusterIP 192.170.22.21 <none> 8080/TCP 12d
kube-prometheus-exporter-coredns ClusterIP None <none> 9153/TCP 12d
kube-prometheus-exporter-kube-controller-manager ClusterIP None <none> 10252/TCP 12d
kube-prometheus-exporter-kube-etcd ClusterIP None <none> 2379/TCP 12d
kube-prometheus-exporter-kube-proxy ClusterIP None <none> 10249/TCP 12d
kube-prometheus-exporter-kube-scheduler ClusterIP None <none> 10251/TCP 12d
kubelet ClusterIP None <none> 10250/TCP,10255/TCP,4194/TCP 12d
ovn-nb ClusterIP 192.170.235.211 <none> 6641/TCP 12d
ovn-sb ClusterIP 192.170.173.26 <none> 6642/TCP 12d
[root@node10 ~]#
kubectl describe 查看Pod信息,存在大量probe检查失败信息
2021-01-07T02:22:01Z|00003|fatal_signal|WARN|terminating with signal 14 (Alarm clock)
/kube-ovn/ovs-healthcheck.sh: line 6: 8310 Alarm clock ovn-sbctl --db=tcp:["${OVN_SB_SERVICE_HOST}"]:"${OVN_SB_SERVICE_PORT}" --timeout=3 show
Warning Unhealthy 63m kubelet, 10.0.128.83 Liveness probe failed: Connecting OVN SB 192.170.173.26:6642
2021-01-07T02:22:01Z|00003|fatal_signal|WARN|terminating with signal 14 (Alarm clock)
/kube-ovn/ovs-healthcheck.sh: line 6: 8271 Alarm clock ovn-sbctl --db=tcp:["${OVN_SB_SERVICE_HOST}"]:"${OVN_SB_SERVICE_PORT}" --timeout=3 show
Warning Unhealthy 63m kubelet, 10.0.128.83 Readiness probe failed: Connecting OVN SB 192.170.173.26:6642
2021-01-07T02:22:06Z|00003|fatal_signal|WARN|terminating with signal 14 (Alarm clock)
/kube-ovn/ovs-healthcheck.sh: line 6: 8697 Alarm clock ovn-sbctl --db=tcp:["${OVN_SB_SERVICE_HOST}"]:"${OVN_SB_SERVICE_PORT}" --timeout=3 show
Warning Unhealthy 63m kubelet, 10.0.128.83 Liveness probe failed: Connecting OVN SB 192.170.173.26:6642
2021-01-07T02:22:06Z|00003|fatal_signal|WARN|terminating with signal 14 (Alarm clock)
/kube-ovn/ovs-healthcheck.sh: line 6: 8664 Alarm clock ovn-sbctl --db=tcp:["${OVN_SB_SERVICE_HOST}"]:"${OVN_SB_SERVICE_PORT}" --timeout=3 show
Warning Unhealthy 62m (x17385 over 12d) kubelet, 10.0.128.83 (combined from similar events): Readiness probe failed: Connecting OVN SB 192.170.173.26:6642
2021-01-07T02:23:36Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)
/kube-ovn/ovs-healthcheck.sh: line 6: 16888 Alarm clock ovn-sbctl --db=tcp:["${OVN_SB_SERVICE_HOST}"]:"${OVN_SB_SERVICE_PORT}" --timeout=3 show
Normal Pulled 36m (x11 over 63m) kubelet, 10.0.128.83 Container image "platforma-225.alauda.cn:60080/acp/kube-ovn:v1.3.0" already present on machine
Normal Killing 16m (x19 over 63m) kubelet, 10.0.128.83 Container openvswitch failed liveness probe, will be restarted
Warning BackOff 2m (x229 over 59m) kubelet, 10.0.128.83 Back-off restarting failed container
这种情况,可以看下是否是kube-proxy消息处理不及时导致的问题。
过滤Pod信息
[root@node10 ~]# kubectl get pod -n kube-system -o wide| grep 10.0.128.83
kube-ovn-cni-snv5m 1/1 Running 0 12d 10.0.128.83 10.0.128.83 <none> <none>
kube-ovn-pinger-f27q5 1/1 Running 0 17h 192.172.0.82 10.0.128.83 <none> <none>
kube-proxy-9qdbm 1/1 Running 0 12d 10.0.128.83 10.0.128.83 <none> <none>
ovs-ovn-7dzsz 0/1 CrashLoopBackOff 207 12d 10.0.128.83 10.0.128.83 <none> <none>
[root@node10 ~]#
查看kube-proxy Pod 日志信息,存在大量OVS Pod重启信息
I0107 02:05:57.018826 1 graceful_termination.go:93] lw: remote out of the list: 192.170.173.26:6642/TCP/10.0.129.77:6642
W0107 02:18:20.362225 1 reflector.go:301] k8s.io/client-go/informers/factory.go:134: watch of *v1.Service ended with: too old resource version: 99775911 (100430249)
I0107 02:25:57.022763 1 graceful_termination.go:93] lw: remote out of the list: 192.170.173.26:6642/TCP/10.0.129.20:6642
I0107 02:25:57.022815 1 graceful_termination.go:93] lw: remote out of the list: 192.170.173.26:6642/TCP/10.0.129.12:6642
手动删除kube-proxy 和 ovs-ovn Pod,等待系统重启Pod,查看状态
[root@node10 ~]# kubectl get pod -n kube-system -o wide|egrep '10\.0\.12[8-9]\.[0-9]{1,3}' | grep 10.0.128.83
kube-ovn-cni-snv5m 1/1 Running 0 12d 10.0.128.83 10.0.128.83 <none> <none>
kube-ovn-pinger-f27q5 1/1 Running 0 17h 192.172.0.82 10.0.128.83 <none> <none>
kube-proxy-r9zqj 1/1 Running 0 33s 10.0.128.83 10.0.128.83 <none> <none>
ovs-ovn-99gbv 1/1 Running 2 111s 10.0.128.83 10.0.128.83 <none> <none>
[root@node10 ~]#
对于部分ovs-ovn Pod状态正常,部分出现crash的情况,因为是同一镜像版本的环境,ovs-ovn本身存在问题的可能性比较小,所以可以排查下kube-proxy的问题。