-
Notifications
You must be signed in to change notification settings - Fork 288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add instruction of workload cluster backup and restore #6783
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #6783 +/- ##
==========================================
+ Coverage 71.64% 72.02% +0.38%
==========================================
Files 556 569 +13
Lines 43199 44299 +1100
==========================================
+ Hits 30948 31908 +960
- Misses 10541 10638 +97
- Partials 1710 1753 +43 ☔ View full report in Codecov by Sentry. |
docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md
Outdated
Show resolved
Hide resolved
docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md
Outdated
Show resolved
Hide resolved
013f4bd
to
4395690
Compare
docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md
Outdated
Show resolved
Hide resolved
|
||
Since cluster failures primarily occur following unsuccessful cluster upgrades, EKS Anywhere takes the proactive step of automatically creating backups for the Cluster API objects that capture the states of both the management cluster and its workload clusters. These backups are stored within the management cluster folder, where the upgrade command is initiated from the Admin machine, and are generated before each upgrade process. | ||
|
||
Although the likelihood of a cluster failure occurring without any associated cluster upgrade operation is relatively low, it is still recommended to manually back up these Cluster API objects on a routine basis. For example, to create a Cluster API backup of a cluster: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although the likelihood of a cluster failure occurring without any associated cluster upgrade operation is relatively low, it is still recommended to manually back up these Cluster API objects on a routine basis. For example, to create a Cluster API backup of a cluster: | |
Although the likelihood of a cluster failure occurring without any associated cluster upgrade operation is relatively low, it is still recommended to manually back up these Cluster API objects after every upgrade. For example, to create a Cluster API backup of a cluster: |
Also, we do a backup before every upgrade right, regardless of success or failure? Should we just be using that to backup instead of having the user run this command? My above suggestion is if we change these instructions to backup the restore folder that gets created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do for upgrade, but not for create. I'm including manual backup instruction in case a cluster just failed without eksa upgrade. This paragraph is not about auto backup during eksa upgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, but is this a new behavior we want users to take? Wondering if we can recommend before any operation that changes cluster or something, or if we want to quantify what that routine basis is? I think this might be worth an issue that can create a backup through our CLI as well instead of running a manual docker command.
docs/content/en/docs/clustermgmt/cluster-backup-restore/backup-cluster.md
Outdated
Show resolved
Hide resolved
docs/content/en/docs/clustermgmt/cluster-backup-restore/backup-cluster.md
Outdated
Show resolved
Hide resolved
docs/content/en/docs/clustermgmt/cluster-backup-restore/backup-cluster.md
Outdated
Show resolved
Hide resolved
docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md
Outdated
Show resolved
Hide resolved
docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md
Outdated
Show resolved
Hide resolved
docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md
Outdated
Show resolved
Hide resolved
## Restore a workload cluster | ||
|
||
Similar to failed management cluster without infrastructure components change, follow the [External etcd backup and restore]({{< relref "../etcd-backup-restore/etcdbackup" >}}) to restore the workload cluster itself from the backup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you also planning on adding a section here for restoring a workload cluster if infrastructure changes do happen?
For example: during a workload cluster upgrade, if all the CP got rolled out but the upgrade failed during worker upgrade, a simple ETCD restore won't work, since doing a restore would cause the node names, IPs and potentially other things to revert to a state that is longer valid.
Etcd restore is really only a good idea if users lose only their etcd cluster entirely and want to recover their data, or they want to revert their own deployments to the previous state and nothing else in the infra (nodes specifically) has changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea i was kinda avoid talking about this until we have a solution for the case where workload cluster is completely inaccessible / infrastructure changed. But you are right, let me add these notes for clarification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/woof
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reads really well @jiayiwang7 . I just have a few small wording suggestions.
docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md
Outdated
Show resolved
Hide resolved
docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md
Outdated
Show resolved
Hide resolved
docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md
Outdated
Show resolved
Hide resolved
docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md
Outdated
Show resolved
Hide resolved
docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md
Outdated
Show resolved
Hide resolved
docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md
Outdated
Show resolved
Hide resolved
docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md
Outdated
Show resolved
Hide resolved
docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md
Outdated
Show resolved
Hide resolved
docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md
Outdated
Show resolved
Hide resolved
docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md
Outdated
Show resolved
Hide resolved
74ae834
to
6953d54
Compare
docs/content/en/docs/clustermgmt/cluster-backup-restore/backup-cluster.md
Outdated
Show resolved
Hide resolved
docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md
Outdated
Show resolved
Hide resolved
docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md
Outdated
Show resolved
Hide resolved
docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md
Outdated
Show resolved
Hide resolved
docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md
Outdated
Show resolved
Hide resolved
docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md
Outdated
Show resolved
Hide resolved
docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md
Outdated
Show resolved
Hide resolved
8f5d280
to
4c8cc1f
Compare
1. Validate all clusters are in the desired state. | ||
|
||
```bash | ||
kubectl get clusters -n default --kubeconfig ${MGMT_CLUSTER_NEW}/${MGMT_CLUSTER_NEW}-eks-a-cluster.kubeconfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could check here that the eks-a clusters are also ready through their status
like with a custom column in this command
|
||
### Cluster not accessible or infrastructure components changed after etcd backup was taken | ||
|
||
If the workload cluster is still accessible, but the infrastructure machines are changed after the etcd backup was taken, you can still try restoring the cluster itself from the etcd backup. Although doing so is risky: it can potentially cause the node names, IPs and other infrastructure configurations to revert to a state that is no longer valid. Restoring etcd effectively takes a cluster back in time and all clients will experience a conflicting, parallel history. This can impact the behavior of watching components like Kubernetes controller managers, EKS Anywhere cluster controller manager, and Cluster API controller managers. You may need to manually update the CAPI infrastructure objects, such as the infra VMs and machines to use the existing or latest configurations in order to bring the workload cluster back to a healthy state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may need to manually update the CAPI infrastructure objects, such as the infra VMs and machines to use the existing or latest configurations in order to bring the workload cluster back to a healthy state.
I think I'm missing the scenario where this is needed. Shouldn't this work in the same way as the process you outline below (restoring backup in a new workload cluster).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea it does. we can remove this for less confusion
* [BottleRocket]({{< relref "../etcd-backup-restore/bottlerocket-etcd-backup/#restore-etcd-from-backup" >}}) | ||
* [Ubuntu]({{< relref "../etcd-backup-restore/ubuntu-rhel-etcd-backup/#restore" >}}) | ||
|
||
You might notice that after restoring the original etcd backup to the new workload cluster `w02`, all the node names have prefix `w01-*` and the nodes go to `NotReady` state. This is because restoring etcd effectively applies the node data from the original cluster which causes a conflicting history and can impact the behavior of watching components like Kubelets, Kubernetes controller managers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a step to delete the old node entries?
I think some of the providers will leave them hanging and will need manual cleanup
|
||
Since cluster failures primarily occur following unsuccessful cluster upgrades, EKS Anywhere takes the proactive step of automatically creating backups for the Cluster API objects that capture the states of both the management cluster and its workload clusters. These backups are stored within the management cluster folder, where the upgrade command is initiated from the Admin machine, and are generated before each upgrade process. | ||
|
||
Although the likelihood of a cluster failure occurring without any associated cluster upgrade operation is relatively low, it is still recommended to manually back up these Cluster API objects on a routine basis. For example, to create a Cluster API backup of a cluster: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, but is this a new behavior we want users to take? Wondering if we can recommend before any operation that changes cluster or something, or if we want to quantify what that routine basis is? I think this might be worth an issue that can create a backup through our CLI as well instead of running a manual docker command.
docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md
Outdated
Show resolved
Hide resolved
docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md
Outdated
Show resolved
Hide resolved
# SSH into the control plane and worker nodes. You must do this for each node. | ||
ssh -i ${SSH_KEY} ${SSH_USERNAME}@<node IP> | ||
apiclient exec admin bash | ||
sheltie |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sheltie | |
sudo sheltie |
Or is the sudo not required here? I think if you ssh to the node and directly run sudo sheltie, that is also enough not requiring you to run the apiclient command right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when you are in admin container, sudo is not required
472f0c9
to
125f9a2
Compare
125f9a2
to
2a66080
Compare
487093f
to
c97f66e
Compare
c97f66e
to
3b96215
Compare
docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good
just nits and typos
docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md
Outdated
Show resolved
Hide resolved
docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md
Outdated
Show resolved
Hide resolved
d118ce6
to
db6be7a
Compare
db6be7a
to
877f989
Compare
/cherry-pick release-0.18 |
@jiayiwang7: once the present PR merges, I will cherry-pick it on top of release-0.18 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jiayiwang7 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
@jiayiwang7: new pull request created: #7445 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Issue #, if available:
Part2 of #6767
Description of changes:
Testing (if applicable):
Documentation added/planned (if applicable):
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.