Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🌱 Foreground deletion for machine deployments #10791

Conversation

Dhairya-Arora01
Copy link
Contributor

@Dhairya-Arora01 Dhairya-Arora01 commented Jun 24, 2024

What this PR does / why we need it: This PR enables foreground deletion of mds

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #10710

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area PR is missing an area label needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 24, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @Dhairya-Arora01. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 24, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign neolit123 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Dhairya-Arora01
Copy link
Contributor Author

/area machinedeployment

@k8s-ci-robot k8s-ci-robot added area/machinedeployment Issues or PRs related to machinedeployments and removed do-not-merge/needs-area PR is missing an area label labels Jun 24, 2024
@Dhairya-Arora01
Copy link
Contributor Author

will do the same for topology.... if you confirm this is okay....

Copy link
Contributor

@killianmuldoon killianmuldoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 24, 2024
@k8s-ci-robot
Copy link
Contributor

@Dhairya-Arora01: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-test-main 0dc1cb0 link true /test pull-cluster-api-test-main
pull-cluster-api-e2e-blocking-main 0dc1cb0 link true /test pull-cluster-api-e2e-blocking-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

}

// else delete owned machinesets.
log.Info("MachineDeployment still has owned MachineSets, deleting them first")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Machine controller we log how many descendants are still in flight, could we do something similar here?

Copy link
Member

@sbueringer sbueringer Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please!

I think here we should also list the MachineSets that still exist by name (see cluster_controller.go "Cluster still has descendants ..." log line)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vincepri Did you mean in the Cluster controller?

Comment on lines +387 to +399
// Attempt to adopt machine if it meets previous conditions and it has no controller references.
if metav1.GetControllerOf(machine) == nil {
if err := r.adoptOrphan(ctx, machineSet, machine); err != nil {
log.Error(err, "Failed to adopt Machine")
r.recorder.Eventf(machineSet, corev1.EventTypeWarning, "FailedAdopt", "Failed to adopt Machine %q: %v", machine.Name, err)
continue
}
log.Info("Adopted Machine")
r.recorder.Eventf(machineSet, corev1.EventTypeNormal, "SuccessfulAdopt", "Adopted Machine %q", machine.Name)
}

filteredMachines = append(filteredMachines, machine)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do this outside? Otherwise we're not retrieving and adopting the machines at the same time

cc @sbueringer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vincepri Do you mean "Otherwise we are retrieving and adopting the machines at the same time"?

I'm not sure if adopting here as well is a bad thing (we also now do the same in reconcileDelete in the MD controller).

Basically it allows us to also fixup ownerRef chains during reconcileDelete, which maybe helps to cover some edge cases?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I meant we're doing both, the trigger for this comment was that getMachinesForMachineSet doesn't mention we might also adopt them

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vincepri So what should we do? Adopt here as well or not?

I think adopting here as well might be better

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rename the function (also in the md controller) to getAndAdoptMachinesForMachineSet?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chrischdi Sounds good

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(in both controllers IIRC)

@@ -282,6 +295,33 @@ func (r *Reconciler) reconcile(ctx context.Context, cluster *clusterv1.Cluster,
return errors.Errorf("unexpected deployment strategy type: %s", md.Spec.Strategy.Type)
}

func (r *Reconciler) reconcileDelete(ctx context.Context, md *clusterv1.MachineDeployment) (reconcile.Result, error) {
log := ctrl.LoggerFrom(ctx)
msList, err := r.getMachineSetsForDeployment(ctx, md)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might use this chance to explore using a controller runtime built-in deleteAllOf https://github.com/kubernetes-sigs/controller-runtime/blob/162a113134deee49b2c93abd9e35211dfe7783e6/pkg/client/interfaces.go#L79-L80, thougths?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for ms/machines

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that might require bigger refactor so it's fine if we want to re-consider in a follow up

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if there is an option that we can use that covers exactly the MachineSets that we want to delete

I think currently we delete all MachineSets that:

  • match the MD selector (except if the selector is empty, but we can just skip deletion entirely in that case)
  • MachineSets without deletionTimestamp (although all MachineSets probably have the deletionTImestamp after the first reconcileDelete)

But then we also don't really have that many MachineSets for MD.

Maybe it's better to precisely delete exactly the MachineSets we want (with corresponding log lines) vs. using a DeleteAllOf and then only the server knows which MachineSets are deleted?

@vincepri
Copy link
Member

What's the status of this PR?

@sbueringer
Copy link
Member

(Will get around to reviewing it soon, I'm aware I got mentioned in a few places :))


// MachineDeploymentFinalizer is the finalizer used by the MachineDeployment controller to
// cleanup the MachineDeployment descendant MachineSets when a MachineDeployment is being deleted.
MachineDeploymentFinalizer = "machinedeployment.cluster.x-k8s.io"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should start using finalizer names that follow the conventions (see #10914 for context)

@JoelSpeed WDYT would cluster.x-k8s.io/machinedeployment be the correct finalizer name here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that would be appropriate while looking into the docs and validation.

I don't think we ever agreed on a pattern in the issue linked, but if we are adding this now we should make sure to cement a pattern

Copy link
Member

@sbueringer sbueringer Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup agree. Now would be the time to figure out the correct pattern. apiGroup/<lower-case kind> for the finalizer of our "main controller" reconciling that object seems like something that would work (note: the controller name of our "main" controller is also lower-case kind)

(I'm saying "main" controller because in a few cases we have two controllers and the additional controllers are named like this: "topology/machineset", "topology/machinedeployment", "topology/cluster")

@fabriziopandini @vincepri @chrischdi @enxebre opinions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the pattern suggested above makes sense to me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -282,6 +295,33 @@ func (r *Reconciler) reconcile(ctx context.Context, cluster *clusterv1.Cluster,
return errors.Errorf("unexpected deployment strategy type: %s", md.Spec.Strategy.Type)
}

func (r *Reconciler) reconcileDelete(ctx context.Context, md *clusterv1.MachineDeployment) (reconcile.Result, error) {
log := ctrl.LoggerFrom(ctx)
msList, err := r.getMachineSetsForDeployment(ctx, md)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if there is an option that we can use that covers exactly the MachineSets that we want to delete

I think currently we delete all MachineSets that:

  • match the MD selector (except if the selector is empty, but we can just skip deletion entirely in that case)
  • MachineSets without deletionTimestamp (although all MachineSets probably have the deletionTImestamp after the first reconcileDelete)

But then we also don't really have that many MachineSets for MD.

Maybe it's better to precisely delete exactly the MachineSets we want (with corresponding log lines) vs. using a DeleteAllOf and then only the server knows which MachineSets are deleted?

}

// else delete owned machinesets.
log.Info("MachineDeployment still has owned MachineSets, deleting them first")
Copy link
Member

@sbueringer sbueringer Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please!

I think here we should also list the MachineSets that still exist by name (see cluster_controller.go "Cluster still has descendants ..." log line)

}
}

return ctrl.Result{RequeueAfter: deleteRequeueAfter}, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with these watches we don't need a requeue?

		Owns(&clusterv1.MachineSet{}).
		// Watches enqueues MachineDeployment for corresponding MachineSet resources, if no managed controller reference (owner) exists.
		Watches(
			&clusterv1.MachineSet{},
			handler.EnqueueRequestsFromMapFunc(r.MachineSetToDeployments),
		).

(the first one covers the case where an ownerRef is already set, the second one everything else)

@@ -73,6 +75,10 @@ var (
stateConfirmationInterval = 100 * time.Millisecond
)

// deleteRequeueAfter is how long to wait before checking again to see if the MachineDeployment
// still has owned MachineSets.
const deleteRequeueAfter = 5 * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, we are watching all Machines, so we should not need a requeue

@@ -282,6 +295,33 @@ func (r *Reconciler) reconcile(ctx context.Context, cluster *clusterv1.Cluster,
return errors.Errorf("unexpected deployment strategy type: %s", md.Spec.Strategy.Type)
}

func (r *Reconciler) reconcileDelete(ctx context.Context, md *clusterv1.MachineDeployment) (reconcile.Result, error) {
Copy link
Member

@sbueringer sbueringer Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's please add unit tests for reconcileDelete (same for the MS controller)

+ also fix the currently failing e2e & unit tests

return ctrl.Result{}, err
}

// If all the descendant machinesets are deleted, then remove the machinedeployment's finalizer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// If all the descendant machinesets are deleted, then remove the machinedeployment's finalizer.
// If all the descendant machinesets are deleted, then remove the MachineSet's finalizer.

}

// else delete owned machinesets.
log.Info("MachineDeployment still has owned MachineSets, deleting them first")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vincepri Did you mean in the Cluster controller?

return ctrl.Result{}, nil
}

log.Info("MachineSet still has owned Machines, deleting them first")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do something similar as https://github.com/kubernetes-sigs/cluster-api/pull/10791/files#r1651944699 here.

I would just say if we have more than 10 Machines, let's cut the list at 10 and add , ...

@sbueringer
Copy link
Member

@Dhairya-Arora01 Do you have time to address the findings above?

@Dhairya-Arora01
Copy link
Contributor Author

@Dhairya-Arora01 Do you have time to address the findings above?

@sbueringer next week for sure!

@sbueringer
Copy link
Member

@Dhairya-Arora01 Please let me know if you don't find the time to work on this. Happy to take over if necessary.

We have a few things that we have to implement on top of this PR that we want to implement pretty soon.

@Dhairya-Arora01
Copy link
Contributor Author

Sorry @sbueringer ... you can take this then

@chrischdi
Copy link
Member

Thanks @Dhairya-Arora01 for kicking this off. Closing here in favour of:

/close

@k8s-ci-robot
Copy link
Contributor

@chrischdi: Closed this PR.

In response to this:

Thanks @Dhairya-Arora01 for kicking this off. Closing here in favour of:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@chrischdi
Copy link
Member

Note: I tried to address all comments at #11174

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/machinedeployment Issues or PRs related to machinedeployments cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider implementing "forced" MD foreground deletion
8 participants