Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node mac address not updated in SNAT info tables after VM image update #1204

Open
robinvalk opened this issue Oct 21, 2024 · 1 comment
Open
Assignees

Comments

@robinvalk
Copy link

Hi all,

We've encountered what we assume is a bug in the ACI CNI. (Please let us know if this is not the correct place to report non provisioning related issues.)

We have a native Kubernetes cluster running on VMs using vSphere. During one of our upgrade procedures of the VM images we noticed that the ACI CNI doesn't update the SNAT info tables correctly. Both the global table as well as node tables are not updated with node changes after the template upgrade.

The upgrade procedure that triggers the bug looks as follows:

  • We taint the node we want to upgrade to remove any workloads from it
  • We delete the node's VM
  • Kubernetes recognises that the node is offline and marks it as not-ready
  • We create a new VM instance using a new VM template image version
  • The node VM comes back online and announces itself to the cluster. As it has the same name etc the kubernetes cluster recognises this node is back online and marks it back as ready.
  • The ACI CNI detect that the node is back online and checks what needs to be updated.
  • It recognises that the same node is back online, and because of apparent optimisations, it doesn't update any record in the SNAT info tables... Because of this the mac address in the info table is never updated!
  • We untaint the node so workload is scheduled to it again

The result is that the SNAT response packets never get redirected to pods running on this node anymore. The other nodes assume that the upgraded node still has the old mac address (because of the ACI CNI config update optimisation). The ACI CNI recognises that the node is updated but it chooses not to act on it. This issue can be fixed by just always acting on node updates.

Luckily we've found a workaround to ensure the SNAT table gets the correct node mac address in the list. Instead of only tainting the node to remove the workloads we completely delete the node object from the cluster. This causes the ACI CNI to remove the Node entry from the SNAT tables. Once the node comes back online and registers itself it's just seen as a new node and a new entry (with the correct mac address) is added to the SNAT table.

SNAT global info table CRD: aci.snat/v1 snatglobalinfos

@fwardzic fwardzic self-assigned this Nov 4, 2024
@fwardzic
Copy link
Contributor

fwardzic commented Nov 4, 2024

Hi Robin,
Thanks for submitting this issue with great details. We are looking into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants