🌱 Foreground deletion for machine deployments #10791

Dhairya-Arora01 · 2024-06-24T17:26:34Z

What this PR does / why we need it: This PR enables foreground deletion of mds

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #10710

k8s-ci-robot · 2024-06-24T17:26:43Z

Hi @Dhairya-Arora01. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2024-06-24T17:26:46Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign neolit123 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Dhairya-Arora01 · 2024-06-24T17:29:25Z

/area machinedeployment

Dhairya-Arora01 · 2024-06-24T17:33:27Z

will do the same for topology.... if you confirm this is okay....

killianmuldoon

/ok-to-test

k8s-ci-robot · 2024-06-24T18:38:35Z

@Dhairya-Arora01: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cluster-api-test-main	`0dc1cb0`	link	true	`/test pull-cluster-api-test-main`
pull-cluster-api-e2e-blocking-main	`0dc1cb0`	link	true	`/test pull-cluster-api-e2e-blocking-main`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

internal/controllers/machinedeployment/machinedeployment_controller.go

vincepri · 2024-06-25T03:41:50Z

internal/controllers/machinedeployment/machinedeployment_controller.go

+	}
+
+	// else delete owned machinesets.
+	log.Info("MachineDeployment still has owned MachineSets, deleting them first")


In the Machine controller we log how many descendants are still in flight, could we do something similar here?

Yes, please!

I think here we should also list the MachineSets that still exist by name (see cluster_controller.go "Cluster still has descendants ..." log line)

@vincepri Did you mean in the Cluster controller?

internal/controllers/machineset/machineset_controller.go

vincepri · 2024-06-25T03:43:52Z

internal/controllers/machineset/machineset_controller.go

+		// Attempt to adopt machine if it meets previous conditions and it has no controller references.
+		if metav1.GetControllerOf(machine) == nil {
+			if err := r.adoptOrphan(ctx, machineSet, machine); err != nil {
+				log.Error(err, "Failed to adopt Machine")
+				r.recorder.Eventf(machineSet, corev1.EventTypeWarning, "FailedAdopt", "Failed to adopt Machine %q: %v", machine.Name, err)
+				continue
+			}
+			log.Info("Adopted Machine")
+			r.recorder.Eventf(machineSet, corev1.EventTypeNormal, "SuccessfulAdopt", "Adopted Machine %q", machine.Name)
+		}
+
+		filteredMachines = append(filteredMachines, machine)
+	}


Should we do this outside? Otherwise we're not retrieving and adopting the machines at the same time

cc @sbueringer

@vincepri Do you mean "Otherwise we are retrieving and adopting the machines at the same time"?

I'm not sure if adopting here as well is a bad thing (we also now do the same in reconcileDelete in the MD controller).

Basically it allows us to also fixup ownerRef chains during reconcileDelete, which maybe helps to cover some edge cases?

Yes I meant we're doing both, the trigger for this comment was that getMachinesForMachineSet doesn't mention we might also adopt them

@vincepri So what should we do? Adopt here as well or not?

I think adopting here as well might be better

Should we rename the function (also in the md controller) to getAndAdoptMachinesForMachineSet?

@chrischdi Sounds good

(in both controllers IIRC)

enxebre · 2024-06-25T11:10:51Z

internal/controllers/machinedeployment/machinedeployment_controller.go

@@ -282,6 +295,33 @@ func (r *Reconciler) reconcile(ctx context.Context, cluster *clusterv1.Cluster,
 	return errors.Errorf("unexpected deployment strategy type: %s", md.Spec.Strategy.Type)
 }

+func (r *Reconciler) reconcileDelete(ctx context.Context, md *clusterv1.MachineDeployment) (reconcile.Result, error) {
+	log := ctrl.LoggerFrom(ctx)
+	msList, err := r.getMachineSetsForDeployment(ctx, md)


We might use this chance to explore using a controller runtime built-in deleteAllOf https://github.com/kubernetes-sigs/controller-runtime/blob/162a113134deee49b2c93abd9e35211dfe7783e6/pkg/client/interfaces.go#L79-L80, thougths?

Same for ms/machines

that might require bigger refactor so it's fine if we want to re-consider in a follow up

I'm not sure if there is an option that we can use that covers exactly the MachineSets that we want to delete

I think currently we delete all MachineSets that:

match the MD selector (except if the selector is empty, but we can just skip deletion entirely in that case)

MachineSets without deletionTimestamp (although all MachineSets probably have the deletionTImestamp after the first reconcileDelete)

But then we also don't really have that many MachineSets for MD.

Maybe it's better to precisely delete exactly the MachineSets we want (with corresponding log lines) vs. using a DeleteAllOf and then only the server knows which MachineSets are deleted?

vincepri · 2024-08-12T16:34:41Z

What's the status of this PR?

sbueringer · 2024-08-12T16:35:36Z

(Will get around to reviewing it soon, I'm aware I got mentioned in a few places :))

sbueringer · 2024-08-27T09:16:53Z

api/v1beta1/machinedeployment_types.go

+
+	// MachineDeploymentFinalizer is the finalizer used by the MachineDeployment controller to
+	// cleanup the MachineDeployment descendant MachineSets when a MachineDeployment is being deleted.
+	MachineDeploymentFinalizer = "machinedeployment.cluster.x-k8s.io"


We should start using finalizer names that follow the conventions (see #10914 for context)

@JoelSpeed WDYT would cluster.x-k8s.io/machinedeployment be the correct finalizer name here?

I believe that would be appropriate while looking into the docs and validation.

I don't think we ever agreed on a pattern in the issue linked, but if we are adding this now we should make sure to cement a pattern

Yup agree. Now would be the time to figure out the correct pattern. apiGroup/<lower-case kind> for the finalizer of our "main controller" reconciling that object seems like something that would work (note: the controller name of our "main" controller is also lower-case kind)

(I'm saying "main" controller because in a few cases we have two controllers and the additional controllers are named like this: "topology/machineset", "topology/machinedeployment", "topology/cluster")

@fabriziopandini @vincepri @chrischdi @enxebre opinions?

the pattern suggested above makes sense to me

@vincepri WDYT (given your recent discussion in Slack: https://kubernetes.slack.com/archives/C0EG7JC6T/p1724853254310249?thread_ts=1721923881.095799&cid=C0EG7JC6T)

sbueringer · 2024-08-27T09:27:02Z

internal/controllers/machinedeployment/machinedeployment_controller.go

@@ -282,6 +295,33 @@ func (r *Reconciler) reconcile(ctx context.Context, cluster *clusterv1.Cluster,
 	return errors.Errorf("unexpected deployment strategy type: %s", md.Spec.Strategy.Type)
 }

+func (r *Reconciler) reconcileDelete(ctx context.Context, md *clusterv1.MachineDeployment) (reconcile.Result, error) {
+	log := ctrl.LoggerFrom(ctx)
+	msList, err := r.getMachineSetsForDeployment(ctx, md)


I'm not sure if there is an option that we can use that covers exactly the MachineSets that we want to delete

I think currently we delete all MachineSets that:

match the MD selector (except if the selector is empty, but we can just skip deletion entirely in that case)

MachineSets without deletionTimestamp (although all MachineSets probably have the deletionTImestamp after the first reconcileDelete)

But then we also don't really have that many MachineSets for MD.

Maybe it's better to precisely delete exactly the MachineSets we want (with corresponding log lines) vs. using a DeleteAllOf and then only the server knows which MachineSets are deleted?

sbueringer · 2024-08-27T09:29:45Z

internal/controllers/machinedeployment/machinedeployment_controller.go

+	}
+
+	// else delete owned machinesets.
+	log.Info("MachineDeployment still has owned MachineSets, deleting them first")


Yes, please!

I think here we should also list the MachineSets that still exist by name (see cluster_controller.go "Cluster still has descendants ..." log line)

sbueringer · 2024-08-27T09:32:26Z

internal/controllers/machinedeployment/machinedeployment_controller.go

+		}
+	}
+
+	return ctrl.Result{RequeueAfter: deleteRequeueAfter}, nil


I think with these watches we don't need a requeue?

Owns(&clusterv1.MachineSet{}). // Watches enqueues MachineDeployment for corresponding MachineSet resources, if no managed controller reference (owner) exists. Watches( &clusterv1.MachineSet{}, handler.EnqueueRequestsFromMapFunc(r.MachineSetToDeployments), ).

(the first one covers the case where an ownerRef is already set, the second one everything else)

sbueringer · 2024-08-27T09:34:43Z

internal/controllers/machineset/machineset_controller.go

@@ -73,6 +75,10 @@ var (
 	stateConfirmationInterval = 100 * time.Millisecond
 )

+// deleteRequeueAfter is how long to wait before checking again to see if the MachineDeployment
+// still has owned MachineSets.
+const deleteRequeueAfter = 5 * time.Second


Same here, we are watching all Machines, so we should not need a requeue

internal/controllers/machineset/machineset_controller.go

sbueringer · 2024-08-27T09:44:27Z

internal/controllers/machinedeployment/machinedeployment_controller.go

@@ -282,6 +295,33 @@ func (r *Reconciler) reconcile(ctx context.Context, cluster *clusterv1.Cluster,
 	return errors.Errorf("unexpected deployment strategy type: %s", md.Spec.Strategy.Type)
 }

+func (r *Reconciler) reconcileDelete(ctx context.Context, md *clusterv1.MachineDeployment) (reconcile.Result, error) {


Let's please add unit tests for reconcileDelete (same for the MS controller)

+ also fix the currently failing e2e & unit tests

sbueringer · 2024-08-27T09:45:45Z

internal/controllers/machinedeployment/machinedeployment_controller.go

+		return ctrl.Result{}, err
+	}
+
+	// If all the descendant machinesets are deleted, then remove the machinedeployment's finalizer.


Suggested change

// If all the descendant machinesets are deleted, then remove the machinedeployment's finalizer.

// If all the descendant machinesets are deleted, then remove the MachineSet's finalizer.

sbueringer · 2024-08-27T09:47:06Z

internal/controllers/machinedeployment/machinedeployment_controller.go

+	}
+
+	// else delete owned machinesets.
+	log.Info("MachineDeployment still has owned MachineSets, deleting them first")


@vincepri Did you mean in the Cluster controller?

sbueringer · 2024-08-27T09:48:04Z

internal/controllers/machineset/machineset_controller.go

+		return ctrl.Result{}, nil
+	}
+
+	log.Info("MachineSet still has owned Machines, deleting them first")


Let's do something similar as https://github.com/kubernetes-sigs/cluster-api/pull/10791/files#r1651944699 here.

I would just say if we have more than 10 Machines, let's cut the list at 10 and add , ...

sbueringer · 2024-08-27T09:53:20Z

@Dhairya-Arora01 Do you have time to address the findings above?

Dhairya-Arora01 · 2024-08-27T10:28:35Z

@Dhairya-Arora01 Do you have time to address the findings above?

@sbueringer next week for sure!

sbueringer · 2024-09-09T09:00:00Z

@Dhairya-Arora01 Please let me know if you don't find the time to work on this. Happy to take over if necessary.

We have a few things that we have to implement on top of this PR that we want to implement pretty soon.

Dhairya-Arora01 · 2024-09-09T09:20:39Z

Sorry @sbueringer ... you can take this then

chrischdi · 2024-09-11T15:25:40Z

Thanks @Dhairya-Arora01 for kicking this off. Closing here in favour of:

🌱 Foreground deletion for MachineDeployments and MachineSets #11174

/close

k8s-ci-robot · 2024-09-11T15:25:46Z

@chrischdi: Closed this PR.

In response to this:

Thanks @Dhairya-Arora01 for kicking this off. Closing here in favour of:

🌱 Foreground deletion for MachineDeployments and MachineSets #11174

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

chrischdi · 2024-09-11T15:29:50Z

Note: I tried to address all comments at #11174

foreground deletion for machine deployments

0dc1cb0

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area PR is missing an area label needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 24, 2024

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 24, 2024

k8s-ci-robot requested review from fabriziopandini and sbueringer June 24, 2024 17:26

k8s-ci-robot added area/machinedeployment Issues or PRs related to machinedeployments and removed do-not-merge/needs-area PR is missing an area label labels Jun 24, 2024

killianmuldoon reviewed Jun 24, 2024

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 24, 2024

vincepri reviewed Jun 25, 2024

View reviewed changes

enxebre reviewed Jun 25, 2024

View reviewed changes

sbueringer mentioned this pull request Jul 3, 2024

Consistently propagate down timeouts from MD => MS => Machines #10753

Open

4 tasks

sbueringer reviewed Aug 27, 2024

View reviewed changes

sbueringer mentioned this pull request Sep 6, 2024

🌱 KCP: propagate timeouts to Machines with deletionTimestamp #11128

Merged

chrischdi mentioned this pull request Sep 11, 2024

🌱 Foreground deletion for MachineDeployments and MachineSets #11174

Merged

k8s-ci-robot closed this Sep 11, 2024

fabriziopandini mentioned this pull request Sep 12, 2024

Our finalizers are not domain-qualified, and therefore do not conform to requirements #10914

Open

	// If all the descendant machinesets are deleted, then remove the machinedeployment's finalizer.
	// If all the descendant machinesets are deleted, then remove the MachineSet's finalizer.

🌱 Foreground deletion for machine deployments #10791

🌱 Foreground deletion for machine deployments #10791

Conversation

Dhairya-Arora01 commented Jun 24, 2024 • edited Loading

k8s-ci-robot commented Jun 24, 2024

k8s-ci-robot commented Jun 24, 2024

Dhairya-Arora01 commented Jun 24, 2024

Dhairya-Arora01 commented Jun 24, 2024

killianmuldoon left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jun 24, 2024

Choose a reason for hiding this comment

sbueringer Aug 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vincepri commented Aug 12, 2024

sbueringer commented Aug 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sbueringer Aug 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sbueringer Aug 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sbueringer Aug 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sbueringer commented Aug 27, 2024

Dhairya-Arora01 commented Aug 27, 2024

sbueringer commented Sep 9, 2024

Dhairya-Arora01 commented Sep 9, 2024

chrischdi commented Sep 11, 2024

k8s-ci-robot commented Sep 11, 2024

chrischdi commented Sep 11, 2024

Dhairya-Arora01 commented Jun 24, 2024 •

edited

Loading

sbueringer Aug 27, 2024 •

edited

Loading

sbueringer Aug 27, 2024 •

edited

Loading

sbueringer Aug 27, 2024 •

edited

Loading

sbueringer Aug 27, 2024 •

edited

Loading