Fixes Kubernetes Service using expired credentials #50074

tigrato · 2024-12-11T17:12:31Z

The Kubernetes service occasionally fails to forward requests to EKS clusters or retrieve the cluster schema due to AWS rejecting the request with an "expired token" error.

EKS access tokens are generated using STS presigned URLs, which include details such as the cluster, backend credentials, and assumed roles. By default, these tokens are valid for 15 minutes, and the Kubernetes service refreshes them every $(15 - 1) / 2 = 7\text{ }minutes$. However, our cloud SDK caches the underlying aws.Session, particularly those with assumed roles, for 15 minutes.

This leads to a scenario where the token is refreshed a second time at approximately 14 minutes, close to the token's 15-minute validity. If the underlying credentials expire before the next token refresh, given that they were reused from the previous query and cached since then, it results in the Kubernetes Service considering the token valid (since it is a Base64-encoded presigned URL without knowledge about the credentials), but AWS EKS cluster rejects the request, treating the credentials as expired.

This PR adds an option to disable cache for EKS STS token signing which results in creating a session per EKS cluster sign process.

Bellow one can find the error message EKS returns.

2024-12-09T17:00:15Z ERRO [KUBERNETE] Failed to update cluster schema error:[
ERROR REPORT:
Original Error: *errors.StatusError the server has asked for the client to provide credentials
Stack Trace:
	github.com/gravitational/teleport/lib/kube/proxy/scheme.go:140 github.com/gravitational/teleport/lib/kube/proxy.newClusterSchemaBuilder
	github.com/gravitational/teleport/lib/kube/proxy/cluster_details.go:193 github.com/gravitational/teleport/lib/kube/proxy.newClusterDetails.func1
	runtime/asm_amd64.s:1695 runtime.goexit
User Message: the server has asked for the client to provide credentials] pid:7.1 start_time:2024-12-09T17:00:15Z proxy/cluster_details.go:210
2024-12-09T17:00:24Z ERRO [KUBERNETE] Failed to update cluster schema  error:[
ERROR REPORT:
Original Error: *errors.StatusError the server has asked for the client to provide credentials
Stack Trace:
	github.com/gravitational/teleport/lib/kube/proxy/scheme.go:140 github.com/gravitational/teleport/lib/kube/proxy.newClusterSchemaBuilder
	github.com/gravitational/teleport/lib/kube/proxy/cluster_details.go:193 github.com/gravitational/teleport/lib/kube/proxy.newClusterDetails.func1
	runtime/asm_amd64.s:1695 runtime.goexit
User Message: the server has asked for the client to provide credentials] pid:7.1 start_time:2024-12-09T17:00:24Z proxy/cluster_details.go:210

Changelog: Fixes an intermittent EKS authentication failure when dealing with EKS auto-discovery.

Fixes #50072

The Kubernetes service occasionally fails to forward requests to EKS clusters or retrieve the cluster schema due to AWS rejecting the request with an "expired token" error. EKS access tokens are generated using STS presigned URLs, which include details such as the cluster, backend credentials, and assumed roles. By default, these tokens are valid for 15 minutes, and the Kubernetes service refreshes them every $(15 - 1) / 2 = 7\text{ }minutes$. However, our cloud SDK caches the underlying `aws.Session`, particularly those with assumed roles, for 15 minutes. This leads to a scenario where the token is refreshed a second time at approximately 14 minutes, close to the token's 15-minute validity. If the underlying credentials expire before the next token refresh, given that they were reused from the previous query and cached since then, it results in the Kubernetes Service considering the token valid (since it is a Base64-encoded presigned URL without knowledge about the credentials), but AWS EKS cluster rejects the request, treating the credentials as expired. This PR adds an option to disable cache for EKS STS token signing which results in creating a session per EKS cluster sign process. Bellow one can find the error message EKS returns. ``` 2024-12-09T17:00:15Z ERRO [KUBERNETE] Failed to update cluster schema error:[ ERROR REPORT: Original Error: *errors.StatusError the server has asked for the client to provide credentials Stack Trace: github.com/gravitational/teleport/lib/kube/proxy/scheme.go:140 github.com/gravitational/teleport/lib/kube/proxy.newClusterSchemaBuilder github.com/gravitational/teleport/lib/kube/proxy/cluster_details.go:193 github.com/gravitational/teleport/lib/kube/proxy.newClusterDetails.func1 runtime/asm_amd64.s:1695 runtime.goexit User Message: the server has asked for the client to provide credentials] pid:7.1 start_time:2024-12-09T17:00:15Z proxy/cluster_details.go:210 2024-12-09T17:00:24Z ERRO [KUBERNETE] Failed to update cluster schema error:[ ERROR REPORT: Original Error: *errors.StatusError the server has asked for the client to provide credentials Stack Trace: github.com/gravitational/teleport/lib/kube/proxy/scheme.go:140 github.com/gravitational/teleport/lib/kube/proxy.newClusterSchemaBuilder github.com/gravitational/teleport/lib/kube/proxy/cluster_details.go:193 github.com/gravitational/teleport/lib/kube/proxy.newClusterDetails.func1 runtime/asm_amd64.s:1695 runtime.goexit User Message: the server has asked for the client to provide credentials] pid:7.1 start_time:2024-12-09T17:00:24Z proxy/cluster_details.go:210 ``` Changelog: Fixes an intermittent EKS authentication failure when dealing with EKS auto-discovery. Signed-off-by: Tiago Silva <[email protected]>

public-teleport-github-review-bot · 2024-12-12T22:56:36Z

@tigrato See the table below for backport results.

Branch	Result
branch/v15	Failed
branch/v16	Create PR
branch/v17	Create PR

tigrato added backport/branch/v15 backport/branch/v16 backport/branch/v17 labels Dec 11, 2024

github-actions bot requested review from kopiczko and zmb3 December 11, 2024 17:13

github-actions bot added kubernetes-access size/sm labels Dec 11, 2024

tigrato requested review from creack and rosstimothy December 12, 2024 16:15

creack approved these changes Dec 12, 2024

View reviewed changes

rosstimothy approved these changes Dec 12, 2024

View reviewed changes

public-teleport-github-review-bot bot removed request for kopiczko and zmb3 December 12, 2024 21:07

tigrato added this pull request to the merge queue Dec 12, 2024

Merged via the queue into master with commit 5ab4b69 Dec 12, 2024
42 checks passed

tigrato deleted the tigrato/fix-eks-intermitent-issue branch December 12, 2024 22:54

This was referenced Dec 13, 2024

[v17] Fixes Kubernetes Service using expired credentials #50197

Merged

[v16] Fixes Kubernetes Service using expired credentials #50198

Merged

tigrato mentioned this pull request Jan 8, 2025

Migrate eks discovery to aws sdk v2 #50603

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes Kubernetes Service using expired credentials #50074

Fixes Kubernetes Service using expired credentials #50074

tigrato commented Dec 11, 2024

public-teleport-github-review-bot bot commented Dec 12, 2024

Fixes Kubernetes Service using expired credentials #50074

Fixes Kubernetes Service using expired credentials #50074

Conversation

tigrato commented Dec 11, 2024

public-teleport-github-review-bot bot commented Dec 12, 2024