Change CNI token refresh retry duration to one minute #9539

huutomerkki · 2024-11-28T11:43:18Z

Description

This PR changes the dynamic timing to constant once per minute refresh rate for the CNI token, as per the discussion in slack about the possible solution for the linked issue.

I have manually verified that following the reproduction steps with these changes included in the calico-node image fixes the issues described in #9235.

Related issues/PRs

Fixes #9235

Todos

[?] Tests
[?] Documentation
[?] Release note

Release Note

TBD

Reminder for the reviewer

Make sure that this PR has the correct labels and milestone set.

Every PR needs one docs-* label.

docs-pr-required: This change requires a change to the documentation that has not been completed yet.
docs-completed: This change has all necessary documentation completed.
docs-not-required: This change has no user-facing impact and requires no docs.

Every PR needs one release-note-* label.

release-note-required: This PR has user-facing changes. Most PRs should have this label.
release-note-not-required: This PR has no user-facing changes.

Other optional labels:

cherry-pick-candidate: This PR should be cherry-picked to an earlier release. For bug fixes only.
needs-operator-pr: This PR is related to install and requires a corresponding change to the operator.

Signed-off-by: Minna Färm <[email protected]>

CLAassistant · 2024-11-28T11:43:24Z

All committers have signed the CLA.

caseydavenport · 2024-12-02T22:29:59Z

@huutomerkki thanks for raising this PR!

I want to push a little bit on the aggressive timer here - for some of our larger scale users, this is going to result in a large increase in the number of TokenRequest API calls that Calico makes.

We have users running 5k node clusters (and even larger!), which will result 5k requests per 60 seconds, or 83 requests per second across the entire cluster! I think this is probably just not workable for that scale.

Some thoughts (in no particular order):

I think refreshing after a failed request is probably the ideal approach, but we need to be careful that we don't allow the CNI plugin to indefinitely propagate its own token. It would need to be some sort of privileged communication back from the CNI plugin to calico/node to trigger this, which might be tricky.
Perhaps just a larger timeout (say, every 1h instead of every 6h) would satisfy my concerns, but I think that wouldn't meet your goals IIRC.
Perhaps we just need this to be configurable, so that it can maintain its reasonable default of 6h but allow users with stricter requirements to configure it to be smaller. I don't love the idea of exposing config for this, as it's a fairly low level thing that most users don't need to care about, but sometimes that is required.

I think my preferred option would be to find a way to communicate back to the token refresher that it needs to generate a new token, but I think exposing such a channel is risky and has potential for abuse if we're not careful (imagine an attacker getting access and requesting non-stop token refresh attempts. We'd need a backstop to rate limit this for sure). This trigger would likely need to be something like a restricted UDS or file-based trigger that the CNI plugin can write to when needed in response to a permissions error - WDYT?

huutomerkki · 2024-12-04T10:47:47Z

@huutomerkki thanks for raising this PR!

I want to push a little bit on the aggressive timer here - for some of our larger scale users, this is going to result in a large increase in the number of TokenRequest API calls that Calico makes.

We have users running 5k node clusters (and even larger!), which will result 5k requests per 60 seconds, or 83 requests per second across the entire cluster! I think this is probably just not workable for that scale.

Some thoughts (in no particular order):
* I think refreshing after a failed request is probably the ideal approach, but we need to be careful that we don't allow the CNI plugin to indefinitely propagate its own token. It would need to be some sort of privileged communication back from the CNI plugin to calico/node to trigger this, which might be tricky.

* Perhaps just a larger timeout (say, every 1h instead of every 6h) would satisfy my concerns, but I think that wouldn't meet your goals IIRC.

* Perhaps we just need this to be configurable, so that it can maintain its reasonable default of 6h but allow users with stricter requirements to configure it to be smaller. I don't love the idea of exposing config for this, as it's a fairly low level thing that most users don't need to care about, but sometimes that is required.
I think my preferred option would be to find a way to communicate back to the token refresher that it needs to generate a new token, but I think exposing such a channel is risky and has potential for abuse if we're not careful (imagine an attacker getting access and requesting non-stop token refresh attempts. We'd need a backstop to rate limit this for sure). This trigger would likely need to be something like a restricted UDS or file-based trigger that the CNI plugin can write to when needed in response to a permissions error - WDYT?

I agree, minute as the refresh rate is a lot for clusters of that size. In the case I have, the requested max refresh time is around 5 minutes, which would be around 10x the current refresh rate over the 24h, and I think I also should increase the rate around the expected expiry time if any of this PR would be going in.

Refreshing the token after a failed request would be the ideal solution in my mind as well, but I think implementing that might be out of my league without some major support.

Adding a config for this would fulfill my goal, and I think it would be something that I could implement.

caseydavenport · 2024-12-12T16:14:39Z

Adding a config for this would fulfill my goal, and I think it would be something that I could implement.

Great, lets go with this as an approach for now and I will see about a more proactive refresh token in my spare time if I can manage it :)

For a config option, I suspect it would look something like this:

A new environment variable - e.g., CNI_TOKEN_REFRESH_DURATION
When unset, the existing behavior (1/4 of the token expiry) would be used.
When set, the variable would be respected instead.

Then we'd also want a corresponding change to the Installation API in the tigera/operator repository: https://github.com/tigera/operator/blob/master/api/v1/installation_types.go#L32-L211

This would be the end-user facing API that would configure the environment variable - we can talk about the specifics of what that looks like when we get there :)

Change CNI token refresh retry duration to one minute

98d855a

Signed-off-by: Minna Färm <[email protected]>

huutomerkki requested a review from a team as a code owner November 28, 2024 11:43

marvin-tigera added this to the Calico v3.30.0 milestone Nov 28, 2024

marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change CNI token refresh retry duration to one minute #9539

Change CNI token refresh retry duration to one minute #9539

huutomerkki commented Nov 28, 2024 •

edited

Loading

CLAassistant commented Nov 28, 2024 •

edited

Loading

caseydavenport commented Dec 2, 2024

huutomerkki commented Dec 4, 2024

caseydavenport commented Dec 12, 2024

Change CNI token refresh retry duration to one minute #9539

Are you sure you want to change the base?

Change CNI token refresh retry duration to one minute #9539

Conversation

huutomerkki commented Nov 28, 2024 • edited Loading

Description

Related issues/PRs

Todos

Release Note

Reminder for the reviewer

CLAassistant commented Nov 28, 2024 • edited Loading

caseydavenport commented Dec 2, 2024

huutomerkki commented Dec 4, 2024

caseydavenport commented Dec 12, 2024

huutomerkki commented Nov 28, 2024 •

edited

Loading

CLAassistant commented Nov 28, 2024 •

edited

Loading