diff --git a/docs/content/en/docs/clustermgmt/cluster-rebootnode.md b/docs/content/en/docs/clustermgmt/cluster-rebootnode.md index c389b5df05a3..97432322b985 100755 --- a/docs/content/en/docs/clustermgmt/cluster-rebootnode.md +++ b/docs/content/en/docs/clustermgmt/cluster-rebootnode.md @@ -17,25 +17,81 @@ Rebooting a cluster node as described here is good for all nodes, but is critica If it does go down while running the `boots` service, the Bottlerocket node will not be able to boot again until the `boots` service is restored on another machine. This is because Bottlerocket must get its address from a DHCP service. {{% /alert %}} -1. Cordon the node so no further workloads are scheduled to run on it: +1. On your admin machine, set the following environment variables that will come in handy later +```bash +export CLUSTER_NAME=mgmt +export MGMT_KUBECONFIG=${CLUSTER_NAME}/${CLUSTER_NAME}-eks-a-cluster.kubeconfig +``` +1. [Backup cluster]({{< relref "/docs/clustermgmt/cluster-backup-restore/backup-cluster" >}}) + + This ensures that there is an up-to-date cluster state available for restoration in the case that the cluster experiences issues or becomes unrecoverable during reboot. + +1. Verify DHCP lease time will be longer than the maintenance time, and that IPs will be the same before and after maintenance. + + This step is critical in ensuring the cluster will be healthy after reboot. If IPs are not preserved before and after reboot, the cluster may not be recoverable. + + {{% alert title="Warning" color="warning" %}} +If this cannot be verified, do not proceed any further + {{% /alert %}} + +1. Pause the reconciliation of the cluster being shut down. + + This ensures that the EKS Anywhere cluster controller will not reconcile on the nodes that are down and try to remediate them. + + - add the paused annotation to the EKSA clusters and CAPI clusters: ```bash - kubectl cordon + kubectl annotate clusters.anywhere.eks.amazonaws.com $CLUSTER_NAME anywhere.eks.amazonaws.com/paused=true --kubeconfig=$MGMT_KUBECONFIG ``` -1. Drain the node of all current workloads: + **NOTE**: If you are using vSphere provider, it is also necessary to set `cluster.spec.paused` to true. For example: + ```bash + kubectl edit clusters.cluster.x-k8s.io -n eksa-system $CLUSTER_NAME --kubeconfig=$MGMT_KUBECONFIG + ``` + add the `paused: true` line under the spec section: + ```bash + ... + spec: + paused: true + ``` + +1. For all of the nodes in the cluster, perform the following steps in this order: worker nodes, control plane nodes, and etcd nodes. + + 1. Cordon the node so no further workloads are scheduled to run on it: - ```bash - kubectl drain - ``` + ```bash + kubectl cordon + ``` -1. Shut down. Using the appropriate method for your provider, shut down the node. + 1. Drain the node of all current workloads: -1. Perform system maintenance or other task you need to do on the node and boot up the node. + ```bash + kubectl drain + ``` -1. Uncordon the node so that it can begin receiving workloads again. + 1. Using the appropriate method for your provider, shut down the node. - ```bash - kubectl uncordon - ``` +1. Perform system maintenance or other tasks you need to do on each node. Then boot up the node in this order: etcd nodes, control plane nodes, and worker nodes. + +1. Uncordon the nodes so that they can begin receiving workloads again. + + ```bash + kubectl uncordon + ``` + +1. Remove the paused annotations from EKS Anywhere cluster. + ```bash + kubectl annotate clusters.anywhere.eks.amazonaws.com $CLUSTER_NAME anywhere.eks.amazonaws.com/paused- --kubeconfig=$MGMT_KUBECONFIG + ``` + + **NOTE**: If you are using vSphere provider, it is also necessary to set `cluster.spec.paused` to false + ```bash + kubectl edit clusters.cluster.x-k8s.io -n eksa-system $CLUSTER_NAME --kubeconfig=$MGMT_KUBECONFIG + ``` + set paused in the spec section to false: + ```bash + ... + spec: + paused: false + ``` \ No newline at end of file