-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🌱 Foreground deletion for machine deployments #10791
🌱 Foreground deletion for machine deployments #10791
Conversation
Hi @Dhairya-Arora01. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/area machinedeployment |
will do the same for topology.... if you confirm this is okay.... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ok-to-test
@Dhairya-Arora01: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
} | ||
|
||
// else delete owned machinesets. | ||
log.Info("MachineDeployment still has owned MachineSets, deleting them first") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the Machine controller we log how many descendants are still in flight, could we do something similar here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please!
I think here we should also list the MachineSets that still exist by name (see cluster_controller.go "Cluster still has descendants ..." log line)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vincepri Did you mean in the Cluster controller?
// Attempt to adopt machine if it meets previous conditions and it has no controller references. | ||
if metav1.GetControllerOf(machine) == nil { | ||
if err := r.adoptOrphan(ctx, machineSet, machine); err != nil { | ||
log.Error(err, "Failed to adopt Machine") | ||
r.recorder.Eventf(machineSet, corev1.EventTypeWarning, "FailedAdopt", "Failed to adopt Machine %q: %v", machine.Name, err) | ||
continue | ||
} | ||
log.Info("Adopted Machine") | ||
r.recorder.Eventf(machineSet, corev1.EventTypeNormal, "SuccessfulAdopt", "Adopted Machine %q", machine.Name) | ||
} | ||
|
||
filteredMachines = append(filteredMachines, machine) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do this outside? Otherwise we're not retrieving and adopting the machines at the same time
cc @sbueringer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vincepri Do you mean "Otherwise we are retrieving and adopting the machines at the same time"?
I'm not sure if adopting here as well is a bad thing (we also now do the same in reconcileDelete in the MD controller).
Basically it allows us to also fixup ownerRef chains during reconcileDelete, which maybe helps to cover some edge cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I meant we're doing both, the trigger for this comment was that getMachinesForMachineSet
doesn't mention we might also adopt them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vincepri So what should we do? Adopt here as well or not?
I think adopting here as well might be better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we rename the function (also in the md controller) to getAndAdoptMachinesForMachineSet
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chrischdi Sounds good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(in both controllers IIRC)
@@ -282,6 +295,33 @@ func (r *Reconciler) reconcile(ctx context.Context, cluster *clusterv1.Cluster, | |||
return errors.Errorf("unexpected deployment strategy type: %s", md.Spec.Strategy.Type) | |||
} | |||
|
|||
func (r *Reconciler) reconcileDelete(ctx context.Context, md *clusterv1.MachineDeployment) (reconcile.Result, error) { | |||
log := ctrl.LoggerFrom(ctx) | |||
msList, err := r.getMachineSetsForDeployment(ctx, md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might use this chance to explore using a controller runtime built-in deleteAllOf https://github.com/kubernetes-sigs/controller-runtime/blob/162a113134deee49b2c93abd9e35211dfe7783e6/pkg/client/interfaces.go#L79-L80, thougths?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for ms/machines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that might require bigger refactor so it's fine if we want to re-consider in a follow up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if there is an option that we can use that covers exactly the MachineSets that we want to delete
I think currently we delete all MachineSets that:
- match the MD selector (except if the selector is empty, but we can just skip deletion entirely in that case)
- MachineSets without deletionTimestamp (although all MachineSets probably have the deletionTImestamp after the first reconcileDelete)
But then we also don't really have that many MachineSets for MD.
Maybe it's better to precisely delete exactly the MachineSets we want (with corresponding log lines) vs. using a DeleteAllOf and then only the server knows which MachineSets are deleted?
What's the status of this PR? |
(Will get around to reviewing it soon, I'm aware I got mentioned in a few places :)) |
|
||
// MachineDeploymentFinalizer is the finalizer used by the MachineDeployment controller to | ||
// cleanup the MachineDeployment descendant MachineSets when a MachineDeployment is being deleted. | ||
MachineDeploymentFinalizer = "machinedeployment.cluster.x-k8s.io" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should start using finalizer names that follow the conventions (see #10914 for context)
@JoelSpeed WDYT would cluster.x-k8s.io/machinedeployment
be the correct finalizer name here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that would be appropriate while looking into the docs and validation.
I don't think we ever agreed on a pattern in the issue linked, but if we are adding this now we should make sure to cement a pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup agree. Now would be the time to figure out the correct pattern. apiGroup/<lower-case kind>
for the finalizer of our "main controller" reconciling that object seems like something that would work (note: the controller name of our "main" controller is also lower-case kind)
(I'm saying "main" controller because in a few cases we have two controllers and the additional controllers are named like this: "topology/machineset", "topology/machinedeployment", "topology/cluster")
@fabriziopandini @vincepri @chrischdi @enxebre opinions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the pattern suggested above makes sense to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vincepri WDYT (given your recent discussion in Slack: https://kubernetes.slack.com/archives/C0EG7JC6T/p1724853254310249?thread_ts=1721923881.095799&cid=C0EG7JC6T)
@@ -282,6 +295,33 @@ func (r *Reconciler) reconcile(ctx context.Context, cluster *clusterv1.Cluster, | |||
return errors.Errorf("unexpected deployment strategy type: %s", md.Spec.Strategy.Type) | |||
} | |||
|
|||
func (r *Reconciler) reconcileDelete(ctx context.Context, md *clusterv1.MachineDeployment) (reconcile.Result, error) { | |||
log := ctrl.LoggerFrom(ctx) | |||
msList, err := r.getMachineSetsForDeployment(ctx, md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if there is an option that we can use that covers exactly the MachineSets that we want to delete
I think currently we delete all MachineSets that:
- match the MD selector (except if the selector is empty, but we can just skip deletion entirely in that case)
- MachineSets without deletionTimestamp (although all MachineSets probably have the deletionTImestamp after the first reconcileDelete)
But then we also don't really have that many MachineSets for MD.
Maybe it's better to precisely delete exactly the MachineSets we want (with corresponding log lines) vs. using a DeleteAllOf and then only the server knows which MachineSets are deleted?
} | ||
|
||
// else delete owned machinesets. | ||
log.Info("MachineDeployment still has owned MachineSets, deleting them first") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please!
I think here we should also list the MachineSets that still exist by name (see cluster_controller.go "Cluster still has descendants ..." log line)
} | ||
} | ||
|
||
return ctrl.Result{RequeueAfter: deleteRequeueAfter}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think with these watches we don't need a requeue?
Owns(&clusterv1.MachineSet{}).
// Watches enqueues MachineDeployment for corresponding MachineSet resources, if no managed controller reference (owner) exists.
Watches(
&clusterv1.MachineSet{},
handler.EnqueueRequestsFromMapFunc(r.MachineSetToDeployments),
).
(the first one covers the case where an ownerRef is already set, the second one everything else)
@@ -73,6 +75,10 @@ var ( | |||
stateConfirmationInterval = 100 * time.Millisecond | |||
) | |||
|
|||
// deleteRequeueAfter is how long to wait before checking again to see if the MachineDeployment | |||
// still has owned MachineSets. | |||
const deleteRequeueAfter = 5 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, we are watching all Machines, so we should not need a requeue
@@ -282,6 +295,33 @@ func (r *Reconciler) reconcile(ctx context.Context, cluster *clusterv1.Cluster, | |||
return errors.Errorf("unexpected deployment strategy type: %s", md.Spec.Strategy.Type) | |||
} | |||
|
|||
func (r *Reconciler) reconcileDelete(ctx context.Context, md *clusterv1.MachineDeployment) (reconcile.Result, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's please add unit tests for reconcileDelete (same for the MS controller)
+ also fix the currently failing e2e & unit tests
return ctrl.Result{}, err | ||
} | ||
|
||
// If all the descendant machinesets are deleted, then remove the machinedeployment's finalizer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// If all the descendant machinesets are deleted, then remove the machinedeployment's finalizer. | |
// If all the descendant machinesets are deleted, then remove the MachineSet's finalizer. |
} | ||
|
||
// else delete owned machinesets. | ||
log.Info("MachineDeployment still has owned MachineSets, deleting them first") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vincepri Did you mean in the Cluster controller?
return ctrl.Result{}, nil | ||
} | ||
|
||
log.Info("MachineSet still has owned Machines, deleting them first") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do something similar as https://github.com/kubernetes-sigs/cluster-api/pull/10791/files#r1651944699 here.
I would just say if we have more than 10 Machines, let's cut the list at 10 and add , ...
@Dhairya-Arora01 Do you have time to address the findings above? |
@sbueringer next week for sure! |
@Dhairya-Arora01 Please let me know if you don't find the time to work on this. Happy to take over if necessary. We have a few things that we have to implement on top of this PR that we want to implement pretty soon. |
Sorry @sbueringer ... you can take this then |
Thanks @Dhairya-Arora01 for kicking this off. Closing here in favour of: /close |
@chrischdi: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Note: I tried to address all comments at #11174 |
What this PR does / why we need it: This PR enables foreground deletion of mds
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #10710