Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Reconcile etcd members on control plane scale down #265

Merged

Conversation

Danil-Grigorev
Copy link
Contributor

@Danil-Grigorev Danil-Grigorev commented Feb 14, 2024

kind/bug

What this PR does / why we need it:
This change allows to establish connectivity to child cluster etcd members, and manage membership during cluster scaling. Specifically this is required when a cluster etcd leader is removed by scaling down procedure, as this causes the cluster API server to become unavailable and never come back online.

Therefore the etcd leader needs to be moved just before requesting node deletion, and etcd membership has to be adjusted as well.

RKE2 follows a different certificate management model opposed to CAPI, therefore we can’t provide CAPI certificates on node bootstrap, and instead have to fetch the generated certificate by the rke2 server during bootstrapping.

New clusters will use regular CAPI certificate management model, where the certificates are provided to the RKE2 agent on the initialization and are generated if missing by cabprke2. Therefore, for the time being there will be 2 co-existing cluster configurations, which will be reduced to the upstream one by performing certificate rotation in the future.

Existing clusters will not be required to perform scale up in order for the fix to take effect. Regular upgrade will do this, even with MaxSurge=0 setting. For them, if both local etcd secret, and child bootstrapped etcd secrets are missing, no etcd operations will be performed. First upgraded node, however, will be the “scale up” to populate the child cluster secret. This way it is guarantied that every cluster will eventually get the fix.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #263

Special notes for your reviewer:

Checklist:

  • squashed commits into logical changes
  • includes documentation
  • adds unit tests
  • adds or updates e2e tests

@Danil-Grigorev Danil-Grigorev force-pushed the reconcile-etcd-members-scale-down branch 5 times, most recently from 5334096 to d6bc7ea Compare February 19, 2024 10:15
@furkatgofurov7 furkatgofurov7 added this to the v0.3.0 milestone Feb 19, 2024
@Danil-Grigorev Danil-Grigorev force-pushed the reconcile-etcd-members-scale-down branch from 06c1752 to 5288427 Compare February 19, 2024 15:09
@Danil-Grigorev Danil-Grigorev added the kind/bug Something isn't working label Feb 19, 2024
@Danil-Grigorev Danil-Grigorev changed the title [WIP] Reconcile etcd members on control plane scale down 🐛 Reconcile etcd members on control plane scale down Feb 19, 2024
@Danil-Grigorev
Copy link
Contributor Author

Current implementation is passing e2e tests with testing a single node upgrade and scaling to 1 scenarios. This will allow to preserve existing clusters, but in order to apply the fix, the cluster control plane replicas will have to be scaled up one node at some moment. Will explore different approach with passing generated certificates to the rke2 server. This may be closer to the upstream, but will require all existing clusters to be re-created, or have some etcd migration mechanism (like rke2 certificate rotation to manually supplied, if such thing is supported).

Ran 2 of 2 Specs in 6350.292 seconds
SUCCESS! -- 2 Passed | 0 Failed | 0 Pending | 0 Skipped
PASS

@Danil-Grigorev Danil-Grigorev added kind/bug Something isn't working and removed kind/bug Something isn't working labels Feb 20, 2024
@Danil-Grigorev Danil-Grigorev force-pushed the reconcile-etcd-members-scale-down branch from 5288427 to 5e6a700 Compare February 20, 2024 15:27
@Danil-Grigorev
Copy link
Contributor Author

@richardcase Presubmit e2e job passed without issues here :) And under 30 minutes.

@Danil-Grigorev Danil-Grigorev force-pushed the reconcile-etcd-members-scale-down branch from 5e6a700 to 0e2909f Compare February 20, 2024 17:08
Copy link
Member

@alexander-demicev alexander-demicev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot for taking care of this problem

salasberryfin
salasberryfin previously approved these changes Mar 27, 2024
@Danil-Grigorev
Copy link
Contributor Author

e2e are failing after rebase and the output is very cryptic. I’m also seeing a bug in the RKE2 code (non-changed) related to PodFailedReason condition - once it is set, it is no longer updated, causing precondition checks to fail.

@Danil-Grigorev Danil-Grigorev force-pushed the reconcile-etcd-members-scale-down branch 3 times, most recently from d48700a to aaa9725 Compare April 8, 2024 19:34
@Danil-Grigorev Danil-Grigorev force-pushed the reconcile-etcd-members-scale-down branch 12 times, most recently from e6bda45 to 58bb752 Compare April 11, 2024 15:06
@Danil-Grigorev Danil-Grigorev force-pushed the reconcile-etcd-members-scale-down branch from 58bb752 to 04b620d Compare April 12, 2024 10:37
@Danil-Grigorev Danil-Grigorev force-pushed the reconcile-etcd-members-scale-down branch from 04b620d to 1ad956d Compare April 12, 2024 11:21
- Disc pressure fix for kube-vip

Signed-off-by: Danil Grigorev <[email protected]>
@Danil-Grigorev Danil-Grigorev force-pushed the reconcile-etcd-members-scale-down branch 2 times, most recently from 8346abb to 734414e Compare April 12, 2024 14:46
Signed-off-by: Danil Grigorev <[email protected]>
@Danil-Grigorev Danil-Grigorev force-pushed the reconcile-etcd-members-scale-down branch from 734414e to 5289859 Compare April 12, 2024 15:53
@Danil-Grigorev
Copy link
Contributor Author

Tests are green again, there were some issues with the code, some problems are however in CAPI framework. It appears multiple machine sets are created per machine deployments in upgrade scenario. Testing code is unable to distinguish between those.

Kube-vip is not helpful as the pods are getting evicted due to MemoryPressure on the child cluster nodes, and the default set of tolerations are for some reason ignored there. Sometimes tests may flake, because with no load balancing solution may cause RKE2 agent to connect on a non-existing (recently removed) node. Other issue I observed with kube-vip, is that leader election is never able to release a lock from a dead pod on a node where the etcd instance is offline. This is likely a client-go issue.

That being said, the PR is ready to be merged, as the functionality is consistent with description.

Copy link
Contributor

@richardcase richardcase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok to me.

@Danil-Grigorev - we should also follow-up on the Kube-vip is not helpful comment.

Copy link
Contributor

@furkatgofurov7 furkatgofurov7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM

@Danil-Grigorev Danil-Grigorev merged commit e27bcbd into rancher:main Apr 19, 2024
7 checks passed
@Danil-Grigorev Danil-Grigorev deleted the reconcile-etcd-members-scale-down branch April 19, 2024 10:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Development

Successfully merging this pull request may close these issues.

Control plane nodes scale down causes etcd to loose quorum and do not restore
6 participants