You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've encountered what we assume is a bug in the ACI CNI. (Please let us know if this is not the correct place to report non provisioning related issues.)
We have a native Kubernetes cluster running on VMs using vSphere. During one of our upgrade procedures of the VM images we noticed that the ACI CNI doesn't update the SNAT info tables correctly. Both the global table as well as node tables are not updated with node changes after the template upgrade.
The upgrade procedure that triggers the bug looks as follows:
We taint the node we want to upgrade to remove any workloads from it
We delete the node's VM
Kubernetes recognises that the node is offline and marks it as not-ready
We create a new VM instance using a new VM template image version
The node VM comes back online and announces itself to the cluster. As it has the same name etc the kubernetes cluster recognises this node is back online and marks it back as ready.
The ACI CNI detect that the node is back online and checks what needs to be updated.
It recognises that the same node is back online, and because of apparent optimisations, it doesn't update any record in the SNAT info tables... Because of this the mac address in the info table is never updated!
We untaint the node so workload is scheduled to it again
The result is that the SNAT response packets never get redirected to pods running on this node anymore. The other nodes assume that the upgraded node still has the old mac address (because of the ACI CNI config update optimisation). The ACI CNI recognises that the node is updated but it chooses not to act on it. This issue can be fixed by just always acting on node updates.
Luckily we've found a workaround to ensure the SNAT table gets the correct node mac address in the list. Instead of only tainting the node to remove the workloads we completely delete the node object from the cluster. This causes the ACI CNI to remove the Node entry from the SNAT tables. Once the node comes back online and registers itself it's just seen as a new node and a new entry (with the correct mac address) is added to the SNAT table.
SNAT global info table CRD: aci.snat/v1 snatglobalinfos
The text was updated successfully, but these errors were encountered:
Hi all,
We've encountered what we assume is a bug in the ACI CNI. (Please let us know if this is not the correct place to report non provisioning related issues.)
We have a native Kubernetes cluster running on VMs using vSphere. During one of our upgrade procedures of the VM images we noticed that the ACI CNI doesn't update the SNAT info tables correctly. Both the global table as well as node tables are not updated with node changes after the template upgrade.
The upgrade procedure that triggers the bug looks as follows:
The result is that the SNAT response packets never get redirected to pods running on this node anymore. The other nodes assume that the upgraded node still has the old mac address (because of the ACI CNI config update optimisation). The ACI CNI recognises that the node is updated but it chooses not to act on it. This issue can be fixed by just always acting on node updates.
Luckily we've found a workaround to ensure the SNAT table gets the correct node mac address in the list. Instead of only tainting the node to remove the workloads we completely delete the node object from the cluster. This causes the ACI CNI to remove the Node entry from the SNAT tables. Once the node comes back online and registers itself it's just seen as a new node and a new entry (with the correct mac address) is added to the SNAT table.
SNAT global info table CRD: aci.snat/v1 snatglobalinfos
The text was updated successfully, but these errors were encountered: