-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changing NodePool configuration should doing a rolling replacement #255
Comments
Interesting. Do you know if it is possible to safely recreate the node pool manually? What steps would you take to do so e.g. using the GCP console/CLI? If this can be fixed, it's going to require changes in the upstream TF provider. |
@lblackstone Do you believe this behavior is still true or can this be closed? |
I don't know. Somebody should test to confirm before closing the issue. |
I can't confirm the diagnosis in the description is accurate; but yes.. replacing a node pool doesn't correctly wait for workloads to transfer, be it from Pulumi or via the Console. |
Thank you for raising this issue. After reviewing Google's official documentation and information provided by the upstream provider, it looks like this is the expected behavior. This behavior ensures that workloads are gracefully rescheduled onto other available nodes before node deletion. This behavior is controlled by GKE and is applied regardless of whether you're managing the infrastructure via Google Cloud Console, gcloud CLI, or Pulumi. Recommended Next Steps:
We will now close this issue as it appears to be working as intended. If you believe you are experiencing behavior that deviates from Google's expected node draining process, we recommend reviewing PDBs, cluster capacity, and preStop hooks. Feel free to reopen this issue if there is anything more to address. Thank you for your time and for being part of the Pulumi community! Best regards, |
Currently when changing a
NodePool
configuration that requires theNodePool
to be recreated, such as changing from an1-standard-1
to an1-standard-4
for exmaple, the NodePool is recreated in a dangerous manner.What currently happens is that Pulumi creates the new
NodePool
, and then tells Google to remove the old one. However, the way Google appears to handle this is to bypass PDBs. What happens is that all nodes in the pool are cordoned off, and then force drained (bypassing PDBs). Unfortunately this is extremely problematic because it removes the insurance that some pods stay running. For example, when we've made this change, our istio installation has gone down, which is a problem because it means we are entirely down until istio is rescheduled and started.In our case, we also realized that we had our
minNodeCount
set to too small of an amount, so when the newNodePool
was created, it was done so with nodes that could not handle the number of pods, and relied upon autoscaling for more nodes to be added. So our downtime was even longer due to this.The text was updated successfully, but these errors were encountered: