Add instruction of workload cluster backup and restore #6783

jiayiwang7 · 2023-10-06T20:19:19Z

Issue #, if available:

Part2 of #6767

Description of changes:

Testing (if applicable):

Documentation added/planned (if applicable):

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

codecov · 2023-10-06T20:25:16Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (7b80274) 71.64% compared to head (db6be7a) 72.02%.
Report is 140 commits behind head on main.

❗ Current head db6be7a differs from pull request most recent head 877f989. Consider uploading reports for the commit 877f989 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #6783      +/-   ##
==========================================
+ Coverage   71.64%   72.02%   +0.38%     
==========================================
  Files         556      569      +13     
  Lines       43199    44299    +1100     
==========================================
+ Hits        30948    31908     +960     
- Misses      10541    10638      +97     
- Partials     1710     1753      +43

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md

vivek-koppuru · 2023-10-11T00:14:03Z

docs/content/en/docs/clustermgmt/cluster-backup-restore/backup-cluster.md

+
+Since cluster failures primarily occur following unsuccessful cluster upgrades, EKS Anywhere takes the proactive step of automatically creating backups for the Cluster API objects that capture the states of both the management cluster and its workload clusters. These backups are stored within the management cluster folder, where the upgrade command is initiated from the Admin machine, and are generated before each upgrade process.
+
+Although the likelihood of a cluster failure occurring without any associated cluster upgrade operation is relatively low, it is still recommended to manually back up these Cluster API objects on a routine basis. For example, to create a Cluster API backup of a cluster:


Suggested change

Although the likelihood of a cluster failure occurring without any associated cluster upgrade operation is relatively low, it is still recommended to manually back up these Cluster API objects on a routine basis. For example, to create a Cluster API backup of a cluster:

Although the likelihood of a cluster failure occurring without any associated cluster upgrade operation is relatively low, it is still recommended to manually back up these Cluster API objects after every upgrade. For example, to create a Cluster API backup of a cluster:

Also, we do a backup before every upgrade right, regardless of success or failure? Should we just be using that to backup instead of having the user run this command? My above suggestion is if we change these instructions to backup the restore folder that gets created.

we do for upgrade, but not for create. I'm including manual backup instruction in case a cluster just failed without eksa upgrade. This paragraph is not about auto backup during eksa upgrade.

I see, but is this a new behavior we want users to take? Wondering if we can recommend before any operation that changes cluster or something, or if we want to quantify what that routine basis is? I think this might be worth an issue that can create a backup through our CLI as well instead of running a manual docker command.

docs/content/en/docs/clustermgmt/cluster-backup-restore/backup-cluster.md

docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md

abhinavmpandey08 · 2023-10-11T03:36:23Z

docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md

+## Restore a workload cluster 
+
+Similar to failed management cluster without infrastructure components change, follow the [External etcd backup and restore]({{< relref "../etcd-backup-restore/etcdbackup" >}}) to restore the workload cluster itself from the backup.


Are you also planning on adding a section here for restoring a workload cluster if infrastructure changes do happen?
For example: during a workload cluster upgrade, if all the CP got rolled out but the upgrade failed during worker upgrade, a simple ETCD restore won't work, since doing a restore would cause the node names, IPs and potentially other things to revert to a state that is longer valid.
Etcd restore is really only a good idea if users lose only their etcd cluster entirely and want to recover their data, or they want to revert their own deployments to the previous state and nothing else in the infra (nodes specifically) has changed.

yea i was kinda avoid talking about this until we have a solution for the case where workload cluster is completely inaccessible / infrastructure changed. But you are right, let me add these notes for clarification.

abhinavmpandey08

/lgtm
/woof

eks-distro-bot · 2023-10-12T18:50:05Z

@abhinavmpandey08:

In response to this:

/lgtm
/woof

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

chrisnegus

This reads really well @jiayiwang7 . I just have a few small wording suggestions.

docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md

docs/content/en/docs/clustermgmt/cluster-backup-restore/backup-cluster.md

docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md

g-gaston · 2023-10-24T20:47:33Z

docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md

+1. Validate all clusters are in the desired state.
+
+    ```bash
+    kubectl get clusters -n default --kubeconfig ${MGMT_CLUSTER_NEW}/${MGMT_CLUSTER_NEW}-eks-a-cluster.kubeconfig


you could check here that the eks-a clusters are also ready through their status
like with a custom column in this command

g-gaston · 2023-10-24T20:50:01Z

docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md

+
+### Cluster not accessible or infrastructure components changed after etcd backup was taken
+
+If the workload cluster is still accessible, but the infrastructure machines are changed after the etcd backup was taken, you can still try restoring the cluster itself from the etcd backup. Although doing so is risky: it can potentially cause the node names, IPs and other infrastructure configurations to revert to a state that is no longer valid. Restoring etcd effectively takes a cluster back in time and all clients will experience a conflicting, parallel history. This can impact the behavior of watching components like Kubernetes controller managers, EKS Anywhere cluster controller manager, and Cluster API controller managers. You may need to manually update the CAPI infrastructure objects, such as the infra VMs and machines to use the existing or latest configurations in order to bring the workload cluster back to a healthy state.


You may need to manually update the CAPI infrastructure objects, such as the infra VMs and machines to use the existing or latest configurations in order to bring the workload cluster back to a healthy state.

I think I'm missing the scenario where this is needed. Shouldn't this work in the same way as the process you outline below (restoring backup in a new workload cluster).

yea it does. we can remove this for less confusion

g-gaston · 2023-10-24T20:51:46Z

docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md

+    * [BottleRocket]({{< relref "../etcd-backup-restore/bottlerocket-etcd-backup/#restore-etcd-from-backup" >}})
+    * [Ubuntu]({{< relref "../etcd-backup-restore/ubuntu-rhel-etcd-backup/#restore" >}})
+
+    You might notice that after restoring the original etcd backup to the new workload cluster `w02`, all the node names have prefix `w01-*` and the nodes go to `NotReady` state. This is because restoring etcd effectively applies the node data from the original cluster which causes a conflicting history and can impact the behavior of watching components like Kubelets, Kubernetes controller managers.


Should we add a step to delete the old node entries?
I think some of the providers will leave them hanging and will need manual cleanup

vivek-koppuru · 2023-10-24T23:29:53Z

docs/content/en/docs/clustermgmt/cluster-backup-restore/backup-cluster.md

+
+Since cluster failures primarily occur following unsuccessful cluster upgrades, EKS Anywhere takes the proactive step of automatically creating backups for the Cluster API objects that capture the states of both the management cluster and its workload clusters. These backups are stored within the management cluster folder, where the upgrade command is initiated from the Admin machine, and are generated before each upgrade process.
+
+Although the likelihood of a cluster failure occurring without any associated cluster upgrade operation is relatively low, it is still recommended to manually back up these Cluster API objects on a routine basis. For example, to create a Cluster API backup of a cluster:


I see, but is this a new behavior we want users to take? Wondering if we can recommend before any operation that changes cluster or something, or if we want to quantify what that routine basis is? I think this might be worth an issue that can create a backup through our CLI as well instead of running a manual docker command.

docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md

vivek-koppuru · 2023-10-25T00:05:53Z

docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md

+# SSH into the control plane and worker nodes. You must do this for each node.
+ssh -i ${SSH_KEY} ${SSH_USERNAME}@<node IP>
+apiclient exec admin bash
+sheltie


Suggested change

sheltie

sudo sheltie

Or is the sudo not required here? I think if you ssh to the node and directly run sudo sheltie, that is also enough not requiring you to run the apiclient command right?

when you are in admin container, sudo is not required

docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md

g-gaston

looks good
just nits and typos

docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md

jiayiwang7 · 2024-01-31T15:46:41Z

/cherry-pick release-0.18

eks-distro-pr-bot · 2024-01-31T15:46:43Z

@jiayiwang7: once the present PR merges, I will cherry-pick it on top of release-0.18 in a new PR and assign it to you.

In response to this:

/cherry-pick release-0.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jiayiwang7 · 2024-01-31T15:46:48Z

/approve

eks-distro-bot · 2024-01-31T15:46:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jiayiwang7

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~docs/OWNERS~~ [jiayiwang7]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

g-gaston

/lgtm

eks-distro-pr-bot · 2024-01-31T15:55:52Z

@jiayiwang7: new pull request created: #7445

In response to this:

/cherry-pick release-0.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jiayiwang7 added the do-not-merge/work-in-progress label Oct 6, 2023

eks-distro-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 6, 2023

github-actions bot added area/docs Documentation documentation labels Oct 6, 2023

vivek-koppuru reviewed Oct 6, 2023

View reviewed changes

docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md Outdated Show resolved Hide resolved

docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md Outdated Show resolved Hide resolved

jiayiwang7 force-pushed the backup-restore-doc branch 2 times, most recently from 013f4bd to 4395690 Compare October 10, 2023 20:21

jiayiwang7 removed the do-not-merge/work-in-progress label Oct 10, 2023

jiayiwang7 requested review from abhinavmpandey08 and taneyland October 10, 2023 20:24

jiayiwang7 changed the title ~~Add instruction of restoring clusters from backup~~ Add instruction of cluster backup and restore Oct 10, 2023

jiayiwang7 changed the title ~~Add instruction of cluster backup and restore~~ Add instruction of cluster backup and restore Oct 10, 2023

abhay-krishna reviewed Oct 10, 2023

View reviewed changes

docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md Outdated Show resolved Hide resolved

vivek-koppuru reviewed Oct 11, 2023

View reviewed changes

abhinavmpandey08 reviewed Oct 11, 2023

View reviewed changes

abhinavmpandey08 approved these changes Oct 12, 2023

View reviewed changes

eks-distro-bot assigned abhinavmpandey08 Oct 12, 2023

eks-distro-bot added the lgtm label Oct 12, 2023

chrisnegus reviewed Oct 13, 2023

View reviewed changes

jiayiwang7 force-pushed the backup-restore-doc branch from 74ae834 to 6953d54 Compare October 16, 2023 17:00

eks-distro-bot removed the lgtm label Oct 16, 2023

g-gaston reviewed Oct 16, 2023

View reviewed changes

jiayiwang7 force-pushed the backup-restore-doc branch 2 times, most recently from 8f5d280 to 4c8cc1f Compare October 24, 2023 20:28

g-gaston reviewed Oct 24, 2023

View reviewed changes

vivek-koppuru reviewed Oct 25, 2023

View reviewed changes

jiayiwang7 force-pushed the backup-restore-doc branch 2 times, most recently from 472f0c9 to 125f9a2 Compare November 2, 2023 18:51

jiayiwang7 force-pushed the backup-restore-doc branch from 125f9a2 to 2a66080 Compare November 2, 2023 20:50

jiayiwang7 changed the title ~~Add instruction of cluster backup and restore~~ Add instruction of workload cluster backup and restore Nov 2, 2023

jiayiwang7 force-pushed the backup-restore-doc branch from 487093f to c97f66e Compare January 3, 2024 21:06

jiayiwang7 added the do-not-merge/work-in-progress label Jan 10, 2024

jiayiwang7 force-pushed the backup-restore-doc branch from c97f66e to 3b96215 Compare January 29, 2024 18:16

jiayiwang7 removed the do-not-merge/work-in-progress label Jan 29, 2024

jiayiwang7 commented Jan 29, 2024

View reviewed changes

docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md Outdated Show resolved Hide resolved

g-gaston reviewed Jan 29, 2024

View reviewed changes

docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md Outdated Show resolved Hide resolved

docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md Outdated Show resolved Hide resolved

jiayiwang7 force-pushed the backup-restore-doc branch from d118ce6 to db6be7a Compare January 30, 2024 20:06

Update workload cluster restore instruction

877f989

jiayiwang7 force-pushed the backup-restore-doc branch from db6be7a to 877f989 Compare January 31, 2024 15:45

eks-distro-bot added the approved label Jan 31, 2024

g-gaston reviewed Jan 31, 2024

View reviewed changes

eks-distro-bot assigned g-gaston Jan 31, 2024

eks-distro-bot added the lgtm label Jan 31, 2024

eks-distro-bot merged commit e92754c into aws:main Jan 31, 2024
7 checks passed

eks-distro-pr-bot mentioned this pull request Jan 31, 2024

[release-0.18] Add instruction of workload cluster backup and restore #7445

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add instruction of workload cluster backup and restore #6783

Add instruction of workload cluster backup and restore #6783

jiayiwang7 commented Oct 6, 2023 •

edited

Loading

codecov bot commented Oct 6, 2023 •

edited

Loading

vivek-koppuru Oct 11, 2023

jiayiwang7 Oct 11, 2023 •

edited

Loading

vivek-koppuru Oct 24, 2023

abhinavmpandey08 Oct 11, 2023

jiayiwang7 Oct 11, 2023

abhinavmpandey08 left a comment

eks-distro-bot commented Oct 12, 2023

chrisnegus left a comment

g-gaston Oct 24, 2023

g-gaston Oct 24, 2023

jiayiwang7 Oct 24, 2023

g-gaston Oct 24, 2023

vivek-koppuru Oct 24, 2023

vivek-koppuru Oct 25, 2023

jiayiwang7 Oct 30, 2023

g-gaston left a comment

jiayiwang7 commented Jan 31, 2024

eks-distro-pr-bot commented Jan 31, 2024

jiayiwang7 commented Jan 31, 2024

eks-distro-bot commented Jan 31, 2024

g-gaston left a comment

eks-distro-pr-bot commented Jan 31, 2024


		Since cluster failures primarily occur following unsuccessful cluster upgrades, EKS Anywhere takes the proactive step of automatically creating backups for the Cluster API objects that capture the states of both the management cluster and its workload clusters. These backups are stored within the management cluster folder, where the upgrade command is initiated from the Admin machine, and are generated before each upgrade process.

		Although the likelihood of a cluster failure occurring without any associated cluster upgrade operation is relatively low, it is still recommended to manually back up these Cluster API objects on a routine basis. For example, to create a Cluster API backup of a cluster:

		## Restore a workload cluster

		Similar to failed management cluster without infrastructure components change, follow the [External etcd backup and restore]({{< relref "../etcd-backup-restore/etcdbackup" >}}) to restore the workload cluster itself from the backup.


		### Cluster not accessible or infrastructure components changed after etcd backup was taken

		If the workload cluster is still accessible, but the infrastructure machines are changed after the etcd backup was taken, you can still try restoring the cluster itself from the etcd backup. Although doing so is risky: it can potentially cause the node names, IPs and other infrastructure configurations to revert to a state that is no longer valid. Restoring etcd effectively takes a cluster back in time and all clients will experience a conflicting, parallel history. This can impact the behavior of watching components like Kubernetes controller managers, EKS Anywhere cluster controller manager, and Cluster API controller managers. You may need to manually update the CAPI infrastructure objects, such as the infra VMs and machines to use the existing or latest configurations in order to bring the workload cluster back to a healthy state.

Add instruction of workload cluster backup and restore #6783

Add instruction of workload cluster backup and restore #6783

Conversation

jiayiwang7 commented Oct 6, 2023 • edited Loading

codecov bot commented Oct 6, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

jiayiwang7 Oct 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhinavmpandey08 left a comment

Choose a reason for hiding this comment

eks-distro-bot commented Oct 12, 2023

chrisnegus left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

g-gaston left a comment

Choose a reason for hiding this comment

jiayiwang7 commented Jan 31, 2024

eks-distro-pr-bot commented Jan 31, 2024

jiayiwang7 commented Jan 31, 2024

eks-distro-bot commented Jan 31, 2024

g-gaston left a comment

Choose a reason for hiding this comment

eks-distro-pr-bot commented Jan 31, 2024

jiayiwang7 commented Oct 6, 2023 •

edited

Loading

codecov bot commented Oct 6, 2023 •

edited

Loading

jiayiwang7 Oct 11, 2023 •

edited

Loading