-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Context Canceled on all authentication attempts for 10 minutes after an API node has shutdown #184
Comments
Hi @Timothy-Dougherty, this sounds suspiciously similar to #169, so you might try reloading the plugin after the Can you turn on trace level logging and post the logs here from a period where authentication attempts are failing? |
In addition to the above, please could you also share the config you are setting up the k8s auth method with? e.g. We do use a single HTTP client for all tokenreview requests, but my expectation would be that its connection gets shut down when the API node is shut down, but it sounds like that connection is staying alive longer for some reason. |
It'll take me a bit to get the trace logs and try the reload as I need to coordinate with another team to do so, hopefully in the next day or so. As for the config and load balancer, we use an AWS network load balancer which connects to the ec2 instances of the api nodes, we have 3 api nodes which get replaced one after the other. The kubernetes_host value is a dns record that resolves to the network load balancer. |
Its worth noting that our NLB wasn't using an attribute called “Terminate connections on deregistration” which will close connections after a target has been deregistered, After turning that setting on and using a very low delay time on deregistration the period is of auth failure reduced from 10 minutes to < 1 minute. Perhaps this is expected behaviour? however would have assumed the plugin should drop connections from its cache that have timed out (maybe not immediately but at least quicker than 10 minutes). |
Thanks, that's really helpful. I don't have a ton of direct experience with AWS NLBs, but sounds like we're forming a good idea of what's going on. From here:
From here:
So it sounds like you've tried out option 1 currently. I'm not clear what the exact disconnection behaviour is for option 2. And option 3 sounds like it's probably left to the responsibility of the client (Vault in this case). During the failure period, I'd be interested to know if Vault sees a 502, 503, 504, or something else, and I think the debug logs should tell us. Perhaps we could trigger a reconnect/close the connection on one or some of those error codes. EDIT: Also for my own reference, AWS LB TargetGroupAttribute reference docs here. |
Sorry to say the trace logs didn't seem to give much more information. There were trace logs reported for the certificate pool not needing an update but the logs related to the auth failures were the ones I've mentioned above:
|
It would have been nice to see this in trace logs, but our api servers send a http2 GOAWAY when they are in the process of shutting down. If the plugin handled those and closed connections that received that, it would also resolve this issue. |
That's weird. Go should close the client connection if it receives a GOAWAY frame. It's actually a relatively recent addition to the |
hmm, I'll admit my understanding of how Go manages its dependecies is not great, but i can see the plugin |
When you got this error, did you also see a huge spike in goroutines? We're seeing something eerily similar to these errors. We're using EKS so we also do not control the Load Balancer config or management, though I believe AWS uses NLBs under the hood: https://aws.github.io/aws-eks-best-practices/reliability/docs/controlplane/#eks-architecture As far as we can tell nothing is happening with our control planes and we're getting this error out of nowhere. Debug logging just surfaces the context cancelled errors and nothing else. Once these errors start kicking off, we start seeing failures across almost all of our EKS Clusters for some time and then it'll stop. |
I am seeing a very similar issue. More info below.. Context Canceled on All Authentication Randomly Vault Version: 13.1 Running Vault in EKS 1.23 (been seeing the issue since 1.22) Once auth starts to fail it fails across all clusters in multiple AWS accounts. Fails for 15 minutes and 30 seconds (almost exactly - at least for the last 2 occurrences, haven’t checked the others). Recovers and then the failures repeat 2-4 times (also for the same 15 min and 30 secs). This occurs very randomly - as much as twice a week or once every 2 weeks. Failures look like: Audit logs show that failures are only with auth/kubernetes but don’t show anything else of substance. Trace logs just enabled (incident hasn’t occurred since) but from the code I don’t know if it will show much more info. While the trigger could likely be after an API server has scaled/shutdown - we have many accounts/clusters and limited visibility with EKS it is hard to determine if that is the exact trigger. Any help would be appreciated! |
@sidewinder12s this dropped off my radar as the 'Terminate connections on deregistration" setting on the NLB was enough to resolve our issue. @alexandersimpson17 vault 13.1 still referenced v0.15.0 of this plugin: the x/net versions were bumped in v0.15.1 again my understanding of go dependencies is not great, but it may be worth trying a later version of the vault that references v0.15.1 or later of this plugin and see if that fixes your issue |
Thanks for the update. Since we're using EKS we're flying a little blind but are going to give vault 1.14 a try since that seemed to include the correct updated plugin version (We had agreed with your assessment of the go deps). |
This issue is still occurring for me (currently on 1.15). I was looking at this plugin and noticed it is using cleanhttp. Cleanhttp hasn't been updated in almost 3 years and uses Go 1.13. I was wondering if anyone thought if changing the plugin to just use the latest version of |
Hi, we're observing the authentication request fail consistently for ~10 minutes after an API node has been shutdown.
We run multiple API nodes and bring one down at a time for updates, all other api nodes are accessible via our loadbalancer and we have confirmed the url used by the plugin operates correctly during this time.
However the vault plugin seems to hold onto a bad connection for 10 minutes and constantly fails all authentication attempts, after this period it operates correctly.
This is the response we see:
“https://{kube api url}/tokenreviews\“: context canceled”
We're using v1.12.3 of Vault
The text was updated successfully, but these errors were encountered: