Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes Kubernetes Service using expired credentials #50074

Merged
merged 1 commit into from
Dec 12, 2024

Conversation

tigrato
Copy link
Contributor

@tigrato tigrato commented Dec 11, 2024

The Kubernetes service occasionally fails to forward requests to EKS clusters or retrieve the cluster schema due to AWS rejecting the request with an "expired token" error.

EKS access tokens are generated using STS presigned URLs, which include details such as the cluster, backend credentials, and assumed roles. By default, these tokens are valid for 15 minutes, and the Kubernetes service refreshes them every $(15 - 1) / 2 = 7\text{ }minutes$. However, our cloud SDK caches the underlying aws.Session, particularly those with assumed roles, for 15 minutes.

This leads to a scenario where the token is refreshed a second time at approximately 14 minutes, close to the token's 15-minute validity. If the underlying credentials expire before the next token refresh, given that they were reused from the previous query and cached since then, it results in the Kubernetes Service considering the token valid (since it is a Base64-encoded presigned URL without knowledge about the credentials), but AWS EKS cluster rejects the request, treating the credentials as expired.

This PR adds an option to disable cache for EKS STS token signing which results in creating a session per EKS cluster sign process.

Bellow one can find the error message EKS returns.

2024-12-09T17:00:15Z ERRO [KUBERNETE] Failed to update cluster schema error:[
ERROR REPORT:
Original Error: *errors.StatusError the server has asked for the client to provide credentials
Stack Trace:
	github.com/gravitational/teleport/lib/kube/proxy/scheme.go:140 github.com/gravitational/teleport/lib/kube/proxy.newClusterSchemaBuilder
	github.com/gravitational/teleport/lib/kube/proxy/cluster_details.go:193 github.com/gravitational/teleport/lib/kube/proxy.newClusterDetails.func1
	runtime/asm_amd64.s:1695 runtime.goexit
User Message: the server has asked for the client to provide credentials] pid:7.1 start_time:2024-12-09T17:00:15Z proxy/cluster_details.go:210
2024-12-09T17:00:24Z ERRO [KUBERNETE] Failed to update cluster schema  error:[
ERROR REPORT:
Original Error: *errors.StatusError the server has asked for the client to provide credentials
Stack Trace:
	github.com/gravitational/teleport/lib/kube/proxy/scheme.go:140 github.com/gravitational/teleport/lib/kube/proxy.newClusterSchemaBuilder
	github.com/gravitational/teleport/lib/kube/proxy/cluster_details.go:193 github.com/gravitational/teleport/lib/kube/proxy.newClusterDetails.func1
	runtime/asm_amd64.s:1695 runtime.goexit
User Message: the server has asked for the client to provide credentials] pid:7.1 start_time:2024-12-09T17:00:24Z proxy/cluster_details.go:210

Changelog: Fixes an intermittent EKS authentication failure when dealing with EKS auto-discovery.

Fixes #50072

The Kubernetes service occasionally fails to forward requests to EKS clusters or retrieve the cluster schema due to AWS rejecting the request with an "expired token" error.

EKS access tokens are generated using STS presigned URLs, which include details such as the cluster, backend credentials, and assumed roles. By default, these tokens are valid for 15 minutes, and the Kubernetes service refreshes them every $(15 - 1) / 2 = 7\text{ }minutes$.
However, our cloud SDK caches the underlying `aws.Session`, particularly those with assumed roles, for 15 minutes.

This leads to a scenario where the token is refreshed a second time at approximately 14 minutes, close to the token's 15-minute validity. If the underlying credentials expire before the next token refresh, given that they were reused from the previous query and cached since then, it  results in the Kubernetes Service considering the token valid (since it is a Base64-encoded presigned URL without knowledge about the credentials), but AWS EKS cluster rejects the request, treating the credentials as expired.

This PR adds an option to disable cache for EKS STS token signing which results in creating a session per EKS cluster sign process.

Bellow one can find the error message EKS returns.
```
2024-12-09T17:00:15Z ERRO [KUBERNETE] Failed to update cluster schema error:[
ERROR REPORT:
Original Error: *errors.StatusError the server has asked for the client to provide credentials
Stack Trace:
	github.com/gravitational/teleport/lib/kube/proxy/scheme.go:140 github.com/gravitational/teleport/lib/kube/proxy.newClusterSchemaBuilder
	github.com/gravitational/teleport/lib/kube/proxy/cluster_details.go:193 github.com/gravitational/teleport/lib/kube/proxy.newClusterDetails.func1
	runtime/asm_amd64.s:1695 runtime.goexit
User Message: the server has asked for the client to provide credentials] pid:7.1 start_time:2024-12-09T17:00:15Z proxy/cluster_details.go:210
2024-12-09T17:00:24Z ERRO [KUBERNETE] Failed to update cluster schema  error:[
ERROR REPORT:
Original Error: *errors.StatusError the server has asked for the client to provide credentials
Stack Trace:
	github.com/gravitational/teleport/lib/kube/proxy/scheme.go:140 github.com/gravitational/teleport/lib/kube/proxy.newClusterSchemaBuilder
	github.com/gravitational/teleport/lib/kube/proxy/cluster_details.go:193 github.com/gravitational/teleport/lib/kube/proxy.newClusterDetails.func1
	runtime/asm_amd64.s:1695 runtime.goexit
User Message: the server has asked for the client to provide credentials] pid:7.1 start_time:2024-12-09T17:00:24Z proxy/cluster_details.go:210
```

Changelog: Fixes an intermittent EKS authentication failure when dealing with EKS auto-discovery.

Signed-off-by: Tiago Silva <[email protected]>
@tigrato tigrato added this pull request to the merge queue Dec 12, 2024
Merged via the queue into master with commit 5ab4b69 Dec 12, 2024
42 checks passed
@tigrato tigrato deleted the tigrato/fix-eks-intermitent-issue branch December 12, 2024 22:54
@public-teleport-github-review-bot

@tigrato See the table below for backport results.

Branch Result
branch/v15 Failed
branch/v16 Create PR
branch/v17 Create PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Discovered EKS clusters intermittently throw the server has asked for the client to provide credentials
3 participants