Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flannel does not handle packets from other node properly #565

Closed
integral424 opened this issue Jun 22, 2019 · 9 comments
Closed

Flannel does not handle packets from other node properly #565

integral424 opened this issue Jun 22, 2019 · 9 comments

Comments

@integral424
Copy link

Describe the bug
Flannel does not handle packets from other node properly.
This line of code may be wrong. It should be cni0.

To Reproduce

  • Install k3s cluster (at least 2 nodes)
  • Deploy 2 pods on each node
  • Send ping from a pod to the other pod

Expected behavior
No ping reply from the other node.

Additional context
After some investigation, I got this conclusion:

  • k3s installs cni0 bridge into the system
  • flannel embedded in k3s requires the name of the bridge to be cbr0 (as it is hard-coded in this file)
  • that conflict makes flannel fail to route packets from other node to the bridge
@vincentmli
Copy link

out of curiosity I tested setting up two k3s node and pod run on each node, there is no network connectivity issue between pods, are you sure if you set it up properly? I have cni0 as bridge instead of cbr0 too. you may need to provide more details on your setup and how you concluded it is the bridge name cbr0/cni0 issue.

@integral424
Copy link
Author

Thanks for your reply! 👍

Trouble only for me? 😟
OK. Let me provide detail and reason for my conclusion.

Details on my setup

My cluster is composed of 3 nodes. (1 master, 2 workers)
They are connected to the same network switch. (connected in L2 layer)

  • master node (agent disabled)
    • raspberry pi 3 model B
    • cpu arch: arm64
    • OS: Arch linux ARM (AArch64)
    • installed packages (related to k3s):
      • ebtables
      • ethtool
    • k3s (server) installtion command: curl -sfL https://get.k3s.io | sh -s - server --node-ip 192.168.1.xxx --flannel-iface eth0 --cluster-domain="k8s.local" --data-dir="/var/lib/rancher/k3s" --disable-agent --no-deploy traefik
    • ip link output (veth pairs are truncated):
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether b8:27:eb:28:10:45 brd ff:ff:ff:ff:ff:ff
  • worker node 1
    • liva x
    • cpu arch: x64
    • OS: Arch linux
    • installed packages (related to k3s):
      • ebtables
      • ethtool
    • k3s (agent) installation command: curl -sfL https://get.k3s.io | K3S_TOKEN=<token> K3S_URL=https://192.168.1.xxx:6443 sh -s - --node-ip 192.168.1.yyy --flannel-iface enp3s0
    • ip link output (veth pairs are truncated):
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether b8:ae:ed:3f:7f:c4 brd ff:ff:ff:ff:ff:ff
3: wlp2s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
    link/ether 40:e2:30:e9:0a:4d brd ff:ff:ff:ff:ff:ff
4: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default 
    link/ether 7a:bf:f5:9e:12:b3 brd ff:ff:ff:ff:ff:ff
5: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 5a:db:c3:9c:20:a7 brd ff:ff:ff:ff:ff:ff
  • worker node 2
    • home built computer
    • cpu arch: x64
    • OS: Arch linux
    • installed packages (related to k3s):
      • ebtables
      • ethtool
    • k3s (agent) installation command: curl -sfL https://get.k3s.io | K3S_TOKEN=<token> K3S_URL=https://192.168.1.xxx:6443 sh -s - --node-ip 192.168.1.zzz --flannel-iface enp2s0
    • ip link output (veth pairs are truncated):
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether d0:50:99:70:e1:22 brd ff:ff:ff:ff:ff:ff
3: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default 
    link/ether 4a:7f:70:1d:4e:df brd ff:ff:ff:ff:ff:ff
4: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 96:50:f2:09:60:db brd ff:ff:ff:ff:ff:ff

Additional note

  • No version info for OS (Arch linux) as it uses rolling release model.

Reason for my conclusion

Just because editing bridge name cbr0/cni0 in /var/lib/rancher/k3s/agent/etc/cni/net.d/10-flannel.conflist resolved my issue.

Detailed description

  • In my cluster, with the setup above, inter-node communication failed. (ping does not reply)
$ kubectl get -A po -o wide
NAMESPACE     NAME                                              READY   STATUS      RESTARTS   AGE     IP              NODE     NOMINATED NODE   READINESS GATES
dev           test1                                             1/1     Running     0          18s     10.42.0.49      worker1  <none>           <none>
kube-system   coredns-695688789-nzgmx                           1/1     Running     2          6d20h   10.42.1.85      worker2  <none>           <none>

$ kubectl exec -it test1 -- ash --login
test1:/# ping 10.42.1.85
PING 10.42.1.85 (10.42.1.85): 56 data bytes
^C
--- 10.42.1.85 ping statistics ---
4 packets transmitted, 0 packets received, 100% packet loss
  • Edit var/lib/rancher/k3s/agent/etc/cni/net.d/10-flannel.conflist to change cbr0 to cni0.
  • Now the ping test described above succeeded.
    • without rebooting machine
    • without restarting k3s-agent service

More verbose description of my investigation (too long, no need to read)

  • I suffered ingress trouble first. (ingress controller is traefik)
    • sometimes it works, sometimes doesn't
    • in bad case, accessing the page provided by ingress results in gateway timeout error
  • I tried rebooting machines, to find that which node traefik pod is deployed is important.
  • When traefik pod and backend pod is deployed on different node, it fails to communicate each other.
  • Using tcpdump, I found vxlan encapsulated packet is sent to the other node.
    • ip route shows the packet is properly routed to the flannel.
    • It appears flannel.1 is a blackhole.
  • Now flannel is in doubt, run find /var/lib/rancher/k3s/agent | grep flannel to find files related to flannel.
    • 2 files are found:
      • /var/lib/rancher/k3s/agent/etc/cni/net.d/10-flannel.conflist
      • /var/lib/rancher/k3s/agent/etc/flannel/net-conf.json
    • The former has: "name": "cbr0" line. No such bridge. Is this line correct?
      • I am not convinced as I don't understand the file's meaning.
    • The latter looks nothing to fix.
  • As a try, edit 10-flannel.conflist to change cbr0 to cni0.
  • Now traefik problem resolved!

@integral424
Copy link
Author

It seems I should understand meaning of 10-flannel.conflist... 😣

@integral424
Copy link
Author

Sorry. I was wrong. 😞

Without editing 10-flannel.conflist, the problem is resolved by deleting /var/lib/cni/networks/cbr0/lock and executing sudo systemctl restart k3s-agent.

It seems changing cbr0 to cni0 in 10-flannel.conflist just avoided lock file issue.
(Because, by doing that, /var/lib/cni/networks/cni0 is created.)

Maybe, in my cluster, lock file was not properly deleted when rebooting the system.
I refine rebooting procedure.

Let me close this issue. Sorry.

@CasparChou
Copy link

CasparChou commented Aug 2, 2019

Same issue for me! but I still no figure not how to solve issues

Should I create /var/lib/cni/networks/cni0 manually ?

@icepic1984
Copy link

icepic1984 commented Apr 12, 2021

@integral424 You are my hero. I am currently trying to setup a local kubernetes cluster (k3s v.1.20.4) on two pi4s (one node and one master) using archlinux. Node and master are successfully setup and can see each other. Ping between pods running on one node are working. However, ping between master and node are timing out. Spend hours debugging the issue and came to the same conclusion. Vlanx packets are send over the network. But the flannel.1 interface is a blackhole and does not send the packages to the destination pod. Was about to give up (after spending hundreds of dollars on equipment) but found this post by chance. Deleting /var/lib/cni/networks/cbr0/lock and restarting did the trick (did it on both the master and the node). I would have never figured this one out.

Thank you so much for reporting back.

Unfortunately i don't have enough knowledge of k3s to find the root-cause. If someone has a hint, i am willing to dig deeper into this.

@vincentmli
Copy link

@integral424 You are my hero. I am currently trying to setup a local kubernetes cluster (k3s v.1.20.4) on two pi4s (one node and one master) using archlinux. Node and master are successfully setup and can see each other. Ping between pods running on one node are working. However, ping between master and node are timing out. Spend hours debugging the issue and came to the same conclusion. Vlanx packets are send over the network. But the flannel.1 interface is a blackhole and does not send the packages to the destination pod. Was about to give up (after spending hundreds of dollars on equipment) but found this post by chance. Deleting /var/lib/cni/networks/cbr0/lock and restarting did the trick (did it on both the master and the node). I would have never figured this one out.

@icepic1984, not sure why deleting /var/lib/cni/networks/cbr0/lock would resolve your problem, can you reproduce the issue? I am interested in this kind of silent packet dropping issue :), I have quite a few video showing how to find out these kind of network problem.

https://youtu.be/jAlgOcYidrE
https://youtu.be/9HNKRP7x57M

@ghost
Copy link

ghost commented Jun 14, 2024

This very odd trick worked for me as well! Thanks @integral424

@DamianoSamperi
Copy link

I have the same problem but that trick doesn't work for me, have you solved it somehow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants