Update cluster reboot nodes doc (#7047)

aws · Nov 18, 2023 · 1212f1d · 1212f1d
1 parent bb00f67
commit 1212f1d
Showing 1 changed file with 68 additions and 12 deletions.
diff --git a/docs/content/en/docs/clustermgmt/cluster-rebootnode.md b/docs/content/en/docs/clustermgmt/cluster-rebootnode.md
@@ -17,25 +17,81 @@ Rebooting a cluster node as described here is good for all nodes, but is critica
 If it does go down while running the `boots` service, the Bottlerocket node will not be able to boot again until the `boots` service is restored on another machine. This is because Bottlerocket must get its address from a DHCP service.
 {{% /alert %}}
 
-1. Cordon the node so no further workloads are scheduled to run on it:
+1. On your admin machine, set the following environment variables that will come in handy later
+```bash
+export CLUSTER_NAME=mgmt
+export MGMT_KUBECONFIG=${CLUSTER_NAME}/${CLUSTER_NAME}-eks-a-cluster.kubeconfig
+```
 
+1. [Backup cluster]({{< relref "/docs/clustermgmt/cluster-backup-restore/backup-cluster" >}}) 
+
+    This ensures that there is an up-to-date cluster state available for restoration in the case that the cluster experiences issues or becomes unrecoverable during reboot.
+
+1. Verify DHCP lease time will be longer than the maintenance time, and that IPs will be the same before and after maintenance. 
+
+    This step is critical in ensuring the cluster will be healthy after reboot. If IPs are not preserved before and after reboot, the cluster may not be recoverable.
+
+    {{% alert title="Warning" color="warning" %}}
+If this cannot be verified, do not proceed any further
+    {{% /alert %}}
+
+1. Pause the reconciliation of the cluster being shut down. 
+
+    This ensures that the EKS Anywhere cluster controller will not reconcile on the nodes that are down and try to remediate them.
+
+    - add the paused annotation to the EKSA clusters and CAPI clusters: 
     ```bash
-    kubectl cordon <node-name>
+    kubectl annotate clusters.anywhere.eks.amazonaws.com $CLUSTER_NAME anywhere.eks.amazonaws.com/paused=true --kubeconfig=$MGMT_KUBECONFIG
     ```
 
-1. Drain the node of all current workloads:
+    **NOTE**: If you are using vSphere provider, it is also necessary to set `cluster.spec.paused` to true. For example:
+    ```bash
+    kubectl edit clusters.cluster.x-k8s.io -n eksa-system $CLUSTER_NAME --kubeconfig=$MGMT_KUBECONFIG
+    ```
+    add the `paused: true` line under the spec section:
+    ```bash
+    ...
+    spec:
+      paused: true
+    ```
+
+1. For all of the nodes in the cluster, perform the following steps in this order: worker nodes, control plane nodes, and etcd nodes.
+
+    1. Cordon the node so no further workloads are scheduled to run on it:
 
-   ```bash
-   kubectl drain <node-name>
-   ```
+        ```bash
+        kubectl cordon <node-name>
+        ```
 
-1. Shut down. Using the appropriate method for your provider, shut down the node.
+    1. Drain the node of all current workloads:
 
-1. Perform system maintenance or other task you need to do on the node and boot up the node.
+        ```bash
+        kubectl drain <node-name>
+        ```
 
-1. Uncordon the node so that it can begin receiving workloads again.
+    1. Using the appropriate method for your provider, shut down the node. 
 
-   ```bash
-   kubectl uncordon <node-name>
-   ```
 
+1. Perform system maintenance or other tasks you need to do on each node. Then boot up the node in this order: etcd nodes, control plane nodes, and worker nodes.
+
+1. Uncordon the nodes so that they can begin receiving workloads again.
+
+    ```bash
+    kubectl uncordon <node-name>
+    ```
+
+1. Remove the paused annotations from EKS Anywhere cluster.
+    ```bash
+    kubectl annotate clusters.anywhere.eks.amazonaws.com $CLUSTER_NAME anywhere.eks.amazonaws.com/paused- --kubeconfig=$MGMT_KUBECONFIG
+    ```
+
+    **NOTE**: If you are using vSphere provider, it is also necessary to set `cluster.spec.paused` to false
+    ```bash
+    kubectl edit clusters.cluster.x-k8s.io -n eksa-system $CLUSTER_NAME --kubeconfig=$MGMT_KUBECONFIG
+    ```
+    set paused in the spec section to false:
+    ```bash
+    ...
+    spec:
+      paused: false
+    ```