Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CNI token refresh can take extra 6hrs to happen after expiration #9235

Open
huutomerkki opened this issue Sep 17, 2024 · 0 comments · May be fixed by #9539
Open

CNI token refresh can take extra 6hrs to happen after expiration #9235

huutomerkki opened this issue Sep 17, 2024 · 0 comments · May be fixed by #9539

Comments

@huutomerkki
Copy link

huutomerkki commented Sep 17, 2024

Expected Behavior

Expired tokens are refreshed within an acceptable timeframe, e.g. few minutes.

Current Behavior

Currently, the token refresher sleeps for (exp - now)/4 = max. 6hrs. When the refresher is asleep, if the token expires, it can only be refreshed by restarting all pods, otherwise the cluster networking is down until the sleep is completed. These values are hardcoded in, see

.

Techically, other times could be inserted with the serviceaccount/token file, but that is only used if the token API is down, see

logrus.WithError(err).Debug("Unable to create token for CNI kubeconfig as token request api is not supported, falling back to local service account token")
.

Possible Solution

This could be fixed by 1. making the hardcoded values configurable, either with a separate configuration or by (optionally) preferring the configuration from file or 2. refreshing straight after a failed request.

Steps to Reproduce (for bugs)

  1. Get a cluster with calico installed
  2. Stop NTP service
  3. Shift time forward by more than 24hrs
  4. No pods are coming up or down, calico plugin connection is unauthorized

From the code we can deduce that calico is sleeping and will refresh in 6hrs or so.

Context

This issue was noticed in testing, but it's still causing the tests to take 6hrs to complete. The same behavior can happen in production if the token for some reason expires earlier than expected.

Your Environment

For the reproduction I just put up a fresh VM with Ubuntu 24, installed chrony and followed this https://docs.tigera.io/calico/latest/getting-started/kubernetes/kind to create a Kind cluster with calico

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants