Resolve comments

aws · Oct 11, 2023 · 74ae834 · 74ae834
1 parent 4395690
commit 74ae834
Show file tree

Hide file tree

Showing 3 changed files with 60 additions and 15 deletions.
diff --git a/docs/content/en/docs/clustermgmt/cluster-backup-restore/backup-cluster.md b/docs/content/en/docs/clustermgmt/cluster-backup-restore/backup-cluster.md
@@ -8,16 +8,24 @@ description: >
   How to backup your EKS Anywhere cluster
 ---
 
-We strongly advise performing regular cluster backups of all the EKS Anywhere clusters under your management. This ensures that you always have an up-to-date cluster state available for restoration in case the cluster experiences issues or becomes unrecoverable. This document outlines the steps for creating the two essential types of backups required for the [EKS Anywhere cluster restore process]({{< relref "./restore-cluster" >}}).
+We strongly advise performing regular cluster backups of all the EKS Anywhere clusters. This ensures that you always have an up-to-date cluster state available for restoration in case the cluster experiences issues or becomes unrecoverable. This document outlines the steps for creating the two essential types of backups required for the [EKS Anywhere cluster restore process]({{< relref "./restore-cluster" >}}).
 
 ## Etcd backup
 
-For optimal cluster maintenance, it is crucial to perform regular etcd backups on all your EKS Anywhere management and workload clusters. In the event of a cluster upgrade failure, make sure to **always** execute the etcd backup beforehand. To create an etcd backup for your cluster, follow the guidelines provided in the [External etcd backup and restore]({{< relref "../etcd-backup-restore/etcdbackup" >}}) section.
+For optimal cluster maintenance, it is crucial to perform regular etcd backups on all your EKS Anywhere management and workload clusters. **Always** take an etcd backup before performing an upgrade so it can be used to restore the cluster to a previous state in the event of a cluster upgrade failure. To create an etcd backup for your cluster, follow the guidelines provided in the [External etcd backup and restore]({{< relref "../etcd-backup-restore/etcdbackup" >}}) section.
 
 
 ## Cluster API backup
 
-Since cluster failures primarily occur following unsuccessful cluster upgrades, EKS Anywhere takes the proactive step of automatically creating backups for the Cluster API objects that capture the states of both the management cluster and its workload clusters. These backups are stored within the management cluster folder, where the upgrade command is initiated from the Admin machine, and are generated before each upgrade process.
+Since cluster failures primarily occur following unsuccessful cluster upgrades, EKS Anywhere takes the proactive step of automatically creating backups for the Cluster API objects that capture the states of both the management cluster and its workload clusters. These backups are stored within the management cluster folder, where the upgrade command is initiated from the Admin machine, and are generated before each management cluster upgrade process. For example, after executing a cluster upgrade command on `mgmt-cluster`, a backup folder is generated with the naming convention of `cluster-state-backup-${timestamp}`:
+
+```bash
+mgmt-cluster/ 
+├── cluster-state-backup-2023-10-11T02_55_56 <------ Folder with a backup of the CAPI objects 
+├── mgmt-cluster-eks-a-cluster.kubeconfig
+├── mgmt-cluster-eks-a-cluster.yaml
+└── generated
+```
 
 Although the likelihood of a cluster failure occurring without any associated cluster upgrade operation is relatively low, it is still recommended to manually back up these Cluster API objects on a routine basis. For example, to create a Cluster API backup of a cluster:
 
@@ -27,10 +35,13 @@ MGMT_CLUSTER="mgmt"
 MGMT_CLUSTER_KUBECONFIG=${MGMT_CLUSTER}/${MGMT_CLUSTER}-eks-a-cluster.kubeconfig
 BACKUP_DIRECTORY=backup-mgmt
 
-# Substitute the container version with whatever EKS Anywhere CLI version you are using
-CONTAINER=public.ecr.aws/eks-anywhere/cli-tools:v0.16.2-eks-a-41
+# Substitute the EKS Anywhere release version with whatever CLI version you are using
+EKSA_RELEASE_VERSION=v0.17.3
+BUNDLE_MANIFEST_URL=$(curl -s https://anywhere-assets.eks.amazonaws.com/releases/eks-a/manifest.yaml | yq ".spec.releases[] | select(.version==\"$EKSA_RELEASE_VERSION\").bundleManifestUrl")
+CLI_TOOLS_IMAGE=$(curl -s $BUNDLE_MANIFEST_URL | yq ".spec.versionsBundles[0].eksa.cliTools.uri")
+
 
-docker run -i --network host -w $(pwd) -v /var/run/docker.sock:/var/run/docker.sock -v $(pwd):/$(pwd) --entrypoint clusterctl ${CONTAINER} move \
+docker run -i --network host -w $(pwd) -v $(pwd):/$(pwd) --entrypoint clusterctl ${CLI_TOOLS_IMAGE} move \
         --namespace eksa-system \
         --kubeconfig $MGMT_CLUSTER_KUBECONFIG \
         --to-directory ${BACKUP_DIRECTORY}

diff --git a/docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md b/docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md
@@ -16,13 +16,19 @@ Always backup your EKS Anywhere cluster. Refer to the [Backup cluster]({{< relre
 
 ## Restore a management cluster
 
-As EKS Anywhere management cluster contains the management components of itself plus all the workload clusters it manages, the restoring process can be more complicated than just restoring all the objects from the etcd backup. To be more specific, all the core EKS Anywhere and Cluster API custom resources, managing the lifecycle (provisioning, upgrading, operating, etc.) of the management and its workload clusters, are stored in the management cluster. This includes all the supporting infrastructure, like virtual machines, networks and load balancers. For example, after a failed cluster upgrade, the infrastructure components can change after the etcd backup was taken. Since the backup does not contain the new state of the half upgraded cluster, simply restoring it can create the virtual machines UUID and IP mismatches, rendering EKS Anywhere incapable of healing the cluster.
+As EKS Anywhere management cluster contains the management components of itself plus all the workload clusters it manages, the restoration process can be more complicated than just restoring all the objects from the etcd backup. To be more specific, all the core EKS Anywhere and Cluster API custom resources, that manage the lifecycle (provisioning, upgrading, operating, etc.) of the management and its workload clusters, are stored in the management cluster. This includes all the supporting infrastructure, like virtual machines, networks and load balancers. For example, after a failed cluster upgrade, the infrastructure components can change after the etcd backup was taken. Since the backup does not contain the new state of the half upgraded cluster, simply restoring it can create virtual machines UUID and IP mismatches, rendering EKS Anywhere incapable of healing the cluster.
 
 Depending on whether the infrastructure components are changed or not after the etcd backup was taken - e.g. machines are rolled out and recreated, new IP address assigned to the machines - different strategy needs to be applied in order to restore the management cluster.
 
 ### Cluster accessible and the infrastructure components not changed after etcd backup was taken
 
-If the management cluster is still accessible through the API server, and the underlying infrastructure machines are not changed after the etcd backup was taken, simply follow the [External etcd backup and restore]({{< relref "../etcd-backup-restore/etcdbackup" >}}) to restore the management cluster itself from the backup.
+If the management cluster is still accessible through the API server, and the underlying infrastructure layer (nodes, machines, VMs, etc.) are not changed after the etcd backup was taken, simply follow the [External etcd backup and restore]({{< relref "../etcd-backup-restore/etcdbackup" >}}) to restore the management cluster itself from the backup.
+
+{{% alert title="Warning" color="warning" %}}
+
+Do not apply the etcd restore unless you are very sure that the infrastructure layer is not changed after the etcd backup was taken - i.e. the node, machines, VMs and their assigned IPs need to be exactly the same as when the backup was taken.
+
+{{% /alert %}}
 
 ### Cluster not accessible or infrastructure components changed after etcd backup was taken
 
@@ -53,21 +59,24 @@ If the cluster is no longer accessible in any means, or the infrastructure machi
     CLUSTER_STATE_BACKUP_LATEST=$(ls -Art ${WORKSPACE_PATH}/${MGMT_CLUSTER_OLD} | grep 'cluster-state-backup' | tail -1)
     CLUSTER_STATE_BACKUP_LATEST_PATH=${WORKSPACE_PATH}/${MGMT_CLUSTER_OLD}/${CLUSTER_STATE_BACKUP_LATEST}/
 
-    # Substitute the container version with whatever EKS Anywhere CLI version you are using
-    CONTAINER=public.ecr.aws/eks-anywhere/cli-tools:v0.16.2-eks-a-41
-
+    # Substitute the EKS Anywhere release version with whatever CLI version you are using
+    EKSA_RELEASE_VERSION=v0.17.3
+    BUNDLE_MANIFEST_URL=$(curl -s https://anywhere-assets.eks.amazonaws.com/releases/eks-a/manifest.yaml | yq ".spec.releases[] | select(.version==\"$EKSA_RELEASE_VERSION\").bundleManifestUrl")
+    CLI_TOOLS_IMAGE=$(curl -s $BUNDLE_MANIFEST_URL | yq ".spec.versionsBundles[0].eksa.cliTools.uri")
 
-    # The clusterctl move command needs to be executed on each workload cluster you have. It will will only move the workload cluster resources from the EKS Anywhere backup to the new management cluster. If you have multiple workload clusters, you have to run the command for each cluster as shown below.
+    # The clusterctl move command needs to be executed for each workload cluster.
+    # It will only move the workload cluster resources from the EKS Anywhere backup to the new management cluster.
+    # If you have multiple workload clusters, you have to run the command for each cluster as shown below.
 
     # Move workload cluster w01 resources to the new management cluster mgmt-new
-    docker run -i --network host -w $(pwd) -v /var/run/docker.sock:/var/run/docker.sock -v $(pwd):/$(pwd) --entrypoint clusterctl ${CONTAINER} move \
+    docker run -i --network host -w $(pwd) -v $(pwd):/$(pwd) --entrypoint clusterctl ${CLI_TOOLS_IMAGE} move \
         --namespace eksa-system \
         --filter-cluster {WORKLOAD_CLUSTER_1} \
         --from-directory ${CLUSTER_STATE_BACKUP_LATEST_PATH} \
         --to-kubeconfig ${MGMT_CLUSTER_NEW_KUBECONFIG}
     
     # Move workload cluster w02 resources to the new management cluster mgmt-new
-    docker run -i --network host -w $(pwd) -v /var/run/docker.sock:/var/run/docker.sock -v $(pwd):/$(pwd) --entrypoint clusterctl ${CONTAINER} move \
+    docker run -i --network host -w $(pwd) -v $(pwd):/$(pwd) --entrypoint clusterctl ${CLI_TOOLS_IMAGE} move \
         --namespace eksa-system \
         --filter-cluster {WORKLOAD_CLUSTER_2} \
         --from-directory ${CLUSTER_STATE_BACKUP_LATEST_PATH} \
@@ -145,4 +154,20 @@ If the cluster is no longer accessible in any means, or the infrastructure machi
 
 ## Restore a workload cluster 
 
+### Cluster accessible and the infrastructure components not changed after etcd backup was taken
+
 Similar to failed management cluster without infrastructure components change, follow the [External etcd backup and restore]({{< relref "../etcd-backup-restore/etcdbackup" >}}) to restore the workload cluster itself from the backup.
+
+{{% alert title="Warning" color="warning" %}}
+
+Do not apply the etcd restore unless you are very sure that the infrastructure layer is not changed after the etcd backup was taken - i.e. the node, machines, VMs and their assigned IPs need to be exactly the same as when the backup was taken.
+
+{{% /alert %}}
+
+### Cluster not accessible or infrastructure components changed after etcd backup was taken
+
+During a workload cluster upgrade, if all the control plane nodes get rolled out but the upgrade fails during worker nodes upgrade, a simple etcd restore will not work, since doing a restore would cause the node names, IPs and potentially other infrastructure configurations to revert to a state that is no longer valid. Similarly, when the workload cluster is completely inaccessible, restoring etcd in a newly created workload cluster will not work due to the mismatch between the new and old clusters' node spec. 
+
+Restoring etcd effectively takes a cluster back in time and all clients will experience a conflicting, parallel history. This can impact the behavior of watching components like Kubernetes controller managers, EKS Anywhere cluster controller manager, Cluster API controller managers. Etcd restore is only suitable if you lose only your etcd cluster and want to recover your data, or revert your own deployments to the previous state and nothing else in the infrastructure layer (nodes specifically) has changed.
+
+Therefore under this extreme circumstance, you may need to manually update the CAPI infrastructure objects, such as the infra VMs and machines to use the existing or latest configurations in order to bring the workload cluster back to healthy state. 
diff --git a/docs/content/en/docs/clustermgmt/etcd-backup-restore/etcdbackup.md b/docs/content/en/docs/clustermgmt/etcd-backup-restore/etcdbackup.md
@@ -11,7 +11,16 @@ External ETCD topology is supported for vSphere, CloudStack and Snow clusters, b
 
 This page contains steps for backing up a cluster by taking an ETCD snapshot, and restoring the cluster from a snapshot.
 
-### Use case
+## Use case
 
 EKS-Anywhere clusters use ETCD as the backing store. Taking a snapshot of ETCD backs up the entire cluster data. This can later be used to restore a cluster back to an earlier state if required. 
+
 ETCD backups can be taken prior to cluster upgrade, so if the upgrade doesn't go as planned, you can restore from the backup.
+
+{{% alert title="Important" color="warning" %}}
+
+Restoring to a previous cluster state is a destructive and destablizing action to take on a running cluster. It should be considered only when all other options have been exhausted.
+
+If you are able to retrieve data using the Kubernetes API server, then etcd is available and you should not restore using an etcd backup.
+
+{{% /alert %}}