Changing NodePool configuration should doing a rolling replacement #255

nesl247 · 2019-11-20T16:53:21Z

Currently when changing a NodePool configuration that requires the NodePool to be recreated, such as changing from a n1-standard-1 to a n1-standard-4 for exmaple, the NodePool is recreated in a dangerous manner.

What currently happens is that Pulumi creates the new NodePool, and then tells Google to remove the old one. However, the way Google appears to handle this is to bypass PDBs. What happens is that all nodes in the pool are cordoned off, and then force drained (bypassing PDBs). Unfortunately this is extremely problematic because it removes the insurance that some pods stay running. For example, when we've made this change, our istio installation has gone down, which is a problem because it means we are entirely down until istio is rescheduled and started.

In our case, we also realized that we had our minNodeCount set to too small of an amount, so when the new NodePool was created, it was done so with nodes that could not handle the number of pods, and relied upon autoscaling for more nodes to be added. So our downtime was even longer due to this.

The text was updated successfully, but these errors were encountered:

pgavlin · 2019-11-20T16:57:22Z

Interesting. Do you know if it is possible to safely recreate the node pool manually? What steps would you take to do so e.g. using the GCP console/CLI?

If this can be fixed, it's going to require changes in the upstream TF provider.

mnlumi · 2023-07-26T20:31:57Z

@lblackstone Do you believe this behavior is still true or can this be closed?

lblackstone · 2023-07-26T20:46:11Z

I don't know. Somebody should test to confirm before closing the issue.

antdking · 2023-09-08T12:07:37Z

I can't confirm the diagnosis in the description is accurate; but yes.. replacing a node pool doesn't correctly wait for workloads to transfer, be it from Pulumi or via the Console.

rshade · 2024-12-10T17:07:13Z

Thank you for raising this issue. After reviewing Google's official documentation and information provided by the upstream provider, it looks like this is the expected behavior.
When a GKE node pool is deleted, Google Cloud does drain the nodes before they are deleted. According to the official GKE documentation:
"When you delete a node pool, GKE drains all the nodes in the node pool, deleting and rescheduling all Pods."

This behavior ensures that workloads are gracefully rescheduled onto other available nodes before node deletion. This behavior is controlled by GKE and is applied regardless of whether you're managing the infrastructure via Google Cloud Console, gcloud CLI, or Pulumi.

Recommended Next Steps:

Adjust the pod disruption budget and set the minAvailable to a lower number, then run pulumi up.
Set the configuration for the new NodePool, then run pulumi up.
Wait for all the pods to transfer
Adjust the pod disruption budget back to the original settings, and then run pulumi up

We will now close this issue as it appears to be working as intended. If you believe you are experiencing behavior that deviates from Google's expected node draining process, we recommend reviewing PDBs, cluster capacity, and preStop hooks. Feel free to reopen this issue if there is anything more to address.

Thank you for your time and for being part of the Pulumi community!

Best regards,
The Pulumi Team

pgavlin assigned stack72 and lblackstone Nov 20, 2019

mikhailshilkov unassigned stack72 Jan 27, 2023

mnlumi unassigned lblackstone Jul 26, 2023

mnlumi added the kind/enhancement Improvements or new features label Jul 26, 2023

rshade self-assigned this Dec 6, 2024

rshade closed this as completed Dec 10, 2024

rshade added the resolution/by-design This issue won't be fixed because the functionality is working as designed label Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changing NodePool configuration should doing a rolling replacement #255

Changing NodePool configuration should doing a rolling replacement #255

nesl247 commented Nov 20, 2019

pgavlin commented Nov 20, 2019

mnlumi commented Jul 26, 2023

lblackstone commented Jul 26, 2023

antdking commented Sep 8, 2023

rshade commented Dec 10, 2024

Changing NodePool configuration should doing a rolling replacement #255

Changing NodePool configuration should doing a rolling replacement #255

Comments

nesl247 commented Nov 20, 2019

pgavlin commented Nov 20, 2019

mnlumi commented Jul 26, 2023

lblackstone commented Jul 26, 2023

antdking commented Sep 8, 2023

rshade commented Dec 10, 2024