docs: RFC for disrupted EBS-Backed StatefulSet delays #6336

AndrewSirenko · 2024-06-07T22:27:28Z

Fixes #N/A

Description
Create RFC for disrupted EBS-Backed StatefulSet delays

How was this change tested?
N/A

Does this change impact docs?

Yes, PR includes docs updates
Yes, issue opened: #
No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

netlify · 2024-06-07T22:27:45Z

✅ Deploy Preview for karpenter-docs-prod canceled.

Name	Link
🔨 Latest commit	`95c2945`
🔍 Latest deploy log	https://app.netlify.com/sites/karpenter-docs-prod/deploys/667b4d1fc653f40008a99ca7

coveralls · 2024-06-07T22:32:02Z

Pull Request Test Coverage Report for Build 9423842569

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 82.474%

Totals
Change from base Build 9422632438:	0.0%
Covered Lines:	5553
Relevant Lines:	6733

💛 - Coveralls

designs/statefulset-disruption.md

AndrewSirenko · 2024-06-12T19:18:57Z

Thank you @FernandoMiguel for the edits!

coveralls · 2024-06-12T19:22:10Z

Pull Request Test Coverage Report for Build 9488442511

Details

0 of 0 changed or added relevant lines in 0 files are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage decreased (-0.01%) to 82.467%

Files with Coverage Reduction	New Missed Lines	%
pkg/providers/amifamily/ami.go	1	90.56%

Totals
Change from base Build 9488317383:	-0.01%
Covered Lines:	5550
Relevant Lines:	6730

💛 - Coveralls

coveralls · 2024-06-12T19:52:41Z

Pull Request Test Coverage Report for Build 9488799347

Details

0 of 0 changed or added relevant lines in 0 files are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage decreased (-0.01%) to 82.467%

Files with Coverage Reduction	New Missed Lines	%
pkg/providers/amifamily/ami.go	1	90.56%

Totals
Change from base Build 9488317383:	-0.01%
Covered Lines:	5550
Relevant Lines:	6730

💛 - Coveralls

coveralls · 2024-06-12T20:27:49Z

Pull Request Test Coverage Report for Build 9489219507

Details

0 of 0 changed or added relevant lines in 0 files are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage increased (+0.02%) to 82.496%

Files with Coverage Reduction	New Missed Lines	%
pkg/providers/amifamily/ami.go	1	90.56%

Totals
Change from base Build 9488317383:	0.02%
Covered Lines:	5552
Relevant Lines:	6730

💛 - Coveralls

coveralls · 2024-06-13T13:35:09Z

Pull Request Test Coverage Report for Build 9500606241

Details

0 of 0 changed or added relevant lines in 0 files are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage increased (+0.02%) to 82.496%

Files with Coverage Reduction	New Missed Lines	%
pkg/providers/amifamily/ami.go	1	90.56%

Totals
Change from base Build 9488317383:	0.02%
Covered Lines:	5552
Relevant Lines:	6730

💛 - Coveralls

coveralls · 2024-06-18T22:20:57Z

Pull Request Test Coverage Report for Build 9572946601

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 82.481%

Totals
Change from base Build 9488317383:	0.0%
Covered Lines:	5551
Relevant Lines:	6730

💛 - Coveralls

coveralls · 2024-06-18T22:49:34Z

Pull Request Test Coverage Report for Build 9573214398

Details

0 of 0 changed or added relevant lines in 0 files are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage decreased (-0.01%) to 82.467%

Files with Coverage Reduction	New Missed Lines	%
pkg/providers/amifamily/ami.go	1	90.56%

Totals
Change from base Build 9569831660:	-0.01%
Covered Lines:	5550
Relevant Lines:	6730

💛 - Coveralls

designs/statefulset-disruption.md

AndrewSirenko · 2024-06-18T23:21:58Z

designs/statefulset-disruption.md

+
+**Problem B. If step 3 doesn't happen before step 4, there will be a 1+ minute delay**
+
+If Karpenter calls EC2 TerminateInstance **before** EC2 DetachVolume calls from EBS CSI Driver Controller pod finish, then the volumes won't be detached **until the old instance terminates**.This delay depends on how long it takes the underlying instance to enter the `terminated` state, which depends on the instance type. Typically 1 minute for `m5a.large`, up to 15 minutes for certain Metals/GPU/Windows instances. See [appendix D1](#d1-ec2-termination--ec2-detachvolume-relationship-) for more context. 


Mention "From my brief experiments, Typically 1 minute for 5ma.large..."

This variability is really unfortunate. I wonder if you see more consistent behavior with ec2:StopInstances? I assume you can detach a volume when an instance is stopped? If so, maybe stopping prior to termination would at least put a more reasonable upper bound on the delay in this case

edit: I see in the appendix:

The guest OS can take a long time to complete shutting down.

But that sounds like it's worth digging into, not sure how/why the OS shutdown sequence could take 15 minutes

Thanks for this @cartermckinnon, two notes for you.

I assume you can detach a volume when an instance is stopped? If so, maybe stopping prior to termination would at least put a more reasonable upper bound on the delay in this case

Great idea on ec2:StopInstances, but I'm not sure if the extra step is worth it considering that one can only detach a volume after instance reaches stopped state (which would take an extra ~8 seconds). I measured the difference between stopping/terminating durations for a couple different instance types here. Looks like stopped is 10-40 seconds faster depending on instance type or EC2 quirks.

My original document was misleading when talking about GPU/Windows instance termination times. Most GPU/Windows instance types are closer to linux termination times than metal instances. I have added an instance termination timings section where I tested a few instance types.

The guest OS can take a long time to complete shutting down.

I can try digging into this with the folks at EC2.

designs/statefulset-disruption.md

jmdeal · 2024-06-18T23:25:29Z

designs/statefulset-disruption.md

+- Configure Kubelet for Graceful Node Shutdown
+- Enable Karpenter Spot Instance interruption handling
+- Use EBS CSI Driver ≥ `v1.x` in order use the [PreStop Lifecycle Hook](https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/faq.md#what-is-the-prestop-lifecycle-hook)
+- Set `.node.tolerateAllTaints=false` when deploying the EBS CSI Driver


If graceful node shutdown is configured, wouldn't it be fine for the driver to tolerate Karpenter's disruption taint?

For 6+ min delays: In theory yes, in practice I have needed to turn tolerateAllTaints off in order to not see 6+min delays. We can trade notes on this.

designs/statefulset-disruption.md

jonathan-innis · 2024-06-18T23:17:21Z

designs/statefulset-disruption.md

+
+See [a proof-of-concept implementation of A2 & B1 in PR #1294](https://github.com/kubernetes-sigs/karpenter/pull/1294)
+
+Finally, we should add the following EBS x Karpenter end-to-end test in karpenter-provider-aws to catch regressions between releases of Karpenter or EBS CSI Driver:


Nice! Really like that we are scoping in testing to this RFC! This is definitely something that we should be actively monitoring!

designs/statefulset-disruption.md

jonathan-innis · 2024-06-18T23:34:26Z

designs/statefulset-disruption.md

+
+This means that our sequence of events will match the ideal diagram from section [Ideal Graceful Shutdown for Stateful Workloads][#ideal-graceful-shutdown-for-stateful-workloads]
+
+We can use similar logic to [today's proof-of-concept implementation](https://github.com/kubernetes-sigs/karpenter/pull/1294), but move it to karpenter-provider-aws and check for `node.Status.VolumesInUse` instead of listing volumeattachment objects. A 20 second max wait was sufficient to prevent delays with m5a instance type, but further testing is needed to ensure it is enough for Windows/GPU instance types.


Is there any kind of signal that the EBS CSI driver could give us to know that everything is detached before terminating? One option is that Karpenter does some work from its neutral code; another option is that the EBS CSI driver injects a finalizer and only removes it when it knows that it has finished the work that it needs to do. This, of course, requires us to create a finalizer that we wait for instance termination. There are similar-ish requirements from ALB though with waiting on instance termination before the target group is deregistered and I'm wondering how much overlap that we have here on these two problems

designs/statefulset-disruption.md

jonathan-innis · 2024-06-18T23:36:21Z

designs/statefulset-disruption.md

+
+Wait for volumes to detach before terminating the instance.
+
+We can do this by waiting for all volumes of drain-able nodes to be marked as not be in use nor attached before terminating the node in c.cloudProvider.Delete (until a maximum of 20 seconds). See [Appendix D3 for the implementation details of this wait](#d) 


This is kind of weird IMO. This basically means that the Delete call is going to return an error until we are good to actually delete the instance. I wonder if there's other ways that we could hook into the deletion flow without having to hack the cloudprovider delete call so that it orchestrates the flow that we're looking for

The problem today is that we only call delete once we get a success, which means that you are either going to have to make this synchronous or you are going to have to do some kind of weird erroring so that we back-off until we actually do the termination

cartermckinnon · 2024-06-19T15:55:20Z

xref: kubernetes/enhancements#4213

coveralls · 2024-06-25T23:09:30Z

Pull Request Test Coverage Report for Build 9670727891

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 82.481%

Totals
Change from base Build 9569831660:	0.0%
Covered Lines:	5551
Relevant Lines:	6730

💛 - Coveralls

coveralls · 2024-06-25T23:09:39Z

Pull Request Test Coverage Report for Build 9670778619

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 82.481%

Totals
Change from base Build 9569831660:	0.0%
Covered Lines:	5551
Relevant Lines:	6730

💛 - Coveralls

github-actions · 2024-07-10T12:05:02Z

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

jmdeal · 2024-07-16T09:01:40Z

/remove-lifecycle stale

AndrewSirenko · 2024-07-24T14:10:02Z

/hold

Edits

github-actions · 2024-08-08T12:05:01Z

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

AndrewSirenko · 2024-08-08T17:20:07Z

@jmdeal Since this change ended up being done at the k-sigs/karpenter level, and we don't plan on making it part of the eventual "pre-node-deletion-hooks" plan, should I move an up-to-date and shortened version of this document to that repo?

jmdeal · 2024-08-13T22:34:00Z

Sorry for the late response @AndrewSirenko, if you wanted to do that that would be great! I'll close this PR out and we can track over in k-sigs.

AndrewSirenko requested a review from a team as a code owner June 7, 2024 22:27

AndrewSirenko requested a review from jigisha620 June 7, 2024 22:27

AndrewSirenko changed the title ~~docs: Create RFC for disrupted EBS-Backed StatefulSet delays~~ docs: RFC for disrupted EBS-Backed StatefulSet delays Jun 7, 2024

AndrewSirenko commented Jun 10, 2024

View reviewed changes

designs/statefulset-disruption.md Outdated Show resolved Hide resolved

AndrewSirenko mentioned this pull request Jun 10, 2024

Volume still hang on Karpenter Node Consolidation/Termination kubernetes-sigs/aws-ebs-csi-driver#1955

Closed

jonathan-innis assigned jmdeal Jun 11, 2024

FernandoMiguel reviewed Jun 12, 2024

View reviewed changes

designs/statefulset-disruption.md Show resolved Hide resolved