✨ machine: Introduce Deletion status field and add timestamps for drain and volumeDetach instead of using the condition #11166

chrischdi · 2024-09-10T15:04:05Z

What this PR does / why we need it:

Intruduces a struct at Machine.status.deletion
The struct contains two timestamps which get populated during machine deletion:
- nodeDrainStartTime
- nodeVolumeDetachStartTime

Both are used to track the starting time and used to check if we reached the timeouts instead of relying on the clusterv1.Condition's lastTransitionTime.
This is to get rid of the issue that the lastTransitionTime get's updated due to changes to the condition and to prepare for the changes for v1beta2, where conditions don't have a lastTransitionTime anymore.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #11126

/area machine

k8s-ci-robot · 2024-09-10T15:04:12Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from chrischdi. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

chrischdi · 2024-09-10T15:04:23Z

/test help

k8s-ci-robot · 2024-09-10T15:04:26Z

@chrischdi: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test pull-cluster-api-build-main
/test pull-cluster-api-e2e-blocking-main
/test pull-cluster-api-e2e-conformance-ci-latest-main
/test pull-cluster-api-e2e-conformance-main
/test pull-cluster-api-e2e-latestk8s-main
/test pull-cluster-api-e2e-main
/test pull-cluster-api-e2e-mink8s-main
/test pull-cluster-api-e2e-upgrade-1-31-1-32-main
/test pull-cluster-api-test-main
/test pull-cluster-api-test-mink8s-main
/test pull-cluster-api-verify-main

The following commands are available to trigger optional jobs:

/test pull-cluster-api-apidiff-main

Use /test all to run the following jobs that were automatically triggered:

pull-cluster-api-apidiff-main
pull-cluster-api-build-main
pull-cluster-api-e2e-blocking-main
pull-cluster-api-test-main
pull-cluster-api-verify-main

In response to this:

/test help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

chrischdi · 2024-09-10T15:04:42Z

/test pull-cluster-api-e2e-main

… and volumeDetach instead of using the condition

chrischdi · 2024-09-10T15:47:50Z

/test pull-cluster-api-e2e-main

chrischdi · 2024-09-10T18:30:50Z

/retest

chrischdi · 2024-09-11T11:22:39Z

/retest

different flake

chrischdi · 2024-09-18T06:38:56Z

/assign sbueringer fabriziopandini

chrischdi · 2024-09-18T06:48:51Z

/test help

k8s-ci-robot · 2024-09-18T06:48:53Z

@chrischdi: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test pull-cluster-api-build-main
/test pull-cluster-api-e2e-blocking-main
/test pull-cluster-api-e2e-conformance-ci-latest-main
/test pull-cluster-api-e2e-conformance-main
/test pull-cluster-api-e2e-latestk8s-main
/test pull-cluster-api-e2e-main
/test pull-cluster-api-e2e-mink8s-main
/test pull-cluster-api-e2e-upgrade-1-31-1-32-main
/test pull-cluster-api-test-main
/test pull-cluster-api-test-mink8s-main
/test pull-cluster-api-verify-main

The following commands are available to trigger optional jobs:

/test pull-cluster-api-apidiff-main

Use /test all to run the following jobs that were automatically triggered:

pull-cluster-api-apidiff-main
pull-cluster-api-build-main
pull-cluster-api-e2e-blocking-main
pull-cluster-api-test-main
pull-cluster-api-verify-main

In response to this:

/test help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

chrischdi · 2024-09-18T06:49:01Z

/test pull-cluster-api-e2e-main

fabriziopandini

Nice, also qualifies as a backport candidate IMO
Also changed the PR type to give some more visibility

/lgtm
cc @vincepri @JoelSpeed for the API changes

api/v1beta1/machine_types.go

k8s-ci-robot · 2024-09-18T07:58:13Z

LGTM label has been added.

Git tree hash: 5d3b081c89d004ac999478da2bc5bf7fc9f47730

JoelSpeed

I'm slightly confused by the premise of this PR given my understanding of @fabriziopandini's recent update to status.

A condition should be a perfectly suitable way to represent the drain state, but you mention there have been issues, do we have any of those written up that you can link to?

Both are used to track the starting time and used to check if we reached the timeouts instead of relying on the clusterv1.Condition's lastTransitionTime.

LastTransitionTime should only be changed when the status is change from true to false or vice versa, does that cause an issue here? Has it been behaving in that way?

This is to get rid of the issue that the lastTransitionTime get's updated due to changes to the condition and to prepare for the changes for v1beta2, where conditions don't have a lastTransitionTime anymore.

It sounds like yes, it sounds like any change to the condition has resulted in the transition time being updated, which is incorrect condition behaviour.

Also, lastTransitionTime is not going anywhere, see the docs for metav1.Condition here. The current implementation is compatible with the future set out in Fabrizio's plan.

JoelSpeed · 2024-09-18T08:49:33Z

api/v1beta1/machine_types.go

@@ -230,10 +230,22 @@ type MachineStatus struct {
 	// Conditions defines current service state of the Machine.
 	// +optional
 	Conditions Conditions `json:"conditions,omitempty"`
+
+	// Deletion is the deletion state of the Machine.


Where did we land with getting the godocs to start with the serialised versions?

Perhaps a little more context might be useful?

Suggested change

// Deletion is the deletion state of the Machine.

// deletion contains information relating to removal of the Machine.

// Only present when the Machine has a deletionTimestamp and is being removed from the cluster.

Where did we land with getting the godocs to start with the serialised versions?

First time I'm hearing about this :). godoc conventions suggest it should be upper case, right? (and that's what we do today everywhere in core CAPI)

I can't see it explicitly called out in the API conventions from upstream, but, take a look at the godoc across the core types (and I know the upstream API reviewers enforce this from experience), https://github.com/kubernetes/api/blob/master/core/v1/types.go, all of the godocs on fields start with the json tag version of the field name, so that when docs are generated, they represent the serialised versions of the strings

Makes sense. Thank you!

@chrischdi Can you please change it here and also open a help-wanted issue so we change it everywhere ?

api/v1beta1/machine_types.go

JoelSpeed · 2024-09-18T08:51:17Z

api/v1beta1/machine_types.go

+// MachineStatusDeletion is the deletion state of the Machine.
+type MachineStatusDeletion struct {
+	// NodeDrainStartTime is the time when the drain of the node started.
+	NodeDrainStartTime *metav1.Time `json:"nodeDrainStartTime,omitempty"`


Do we also need a finish time? Same for the other field?

I personally don't think that drain or detach finish time is an important info to have in the API (users & SRE mostly care about what is going on now and eventually why it is stuck, rarely they care about at what happened and for this log a more exhaustive)
But no strong opinion.

If I know that these have started, how do I know that they have finished if I don't have some field to tell me? 🤔 I realise eventually the machine is going away, but, what if it gets stuck terminating the instance, will that show up somewhere and will I know that drain and volume detach are done?

how do I know that they have finished if I don't have some field to tell me? 🤔

From controller-perspective we (at least currently) do not care:

either the controller tries to drain again which should be a no-op (happy path)

or the drain is skipped because the timeout is reached.

From the user perspective: the information where the deletion is at should be part of the Deleting condition I'd say.

I think in general the new Deleting condition should make clear at which phase of the deletion workflow we are (including making clear which parts are already completed)

JoelSpeed · 2024-09-18T08:52:29Z

api/v1beta1/machine_types.go

 }

 // ANCHOR_END: MachineStatus

+// MachineStatusDeletion is the deletion state of the Machine.
+type MachineStatusDeletion struct {
+	// NodeDrainStartTime is the time when the drain of the node started.


I feel like we can probably add some more context to this, what does it mean when it's not present for example, what does it mean if it has elapsed for some period and the Machine is still here, what hints can we give to end users?

Tried to add some more context on both keys.

api/v1beta1/machine_types.go

JoelSpeed · 2024-09-18T08:54:01Z

api/v1beta1/machine_types.go

 }

 // ANCHOR_END: MachineStatus

+// MachineStatusDeletion is the deletion state of the Machine.
+type MachineStatusDeletion struct {


Should we include information here related to the drain, such as the configuration for the timeout?

Since the drain fields are optional in the spec, it would be good perhaps to show the configured values here so that you can correlate between start time and expected end time just by looking at the status?

This is an interesting idea, not really sure how we can represent we are waiting forever in a clear way (not showing timeout 0).

A value of -1 potentially? But you're right, timeout 0 is awkward 🤔

I think the status that it is still waiting should then be part of the condition Deleting condition message? 🤔 (or as long as that one is not around, the DrainSucceeded condition message.

Yep, having this in the conditions seems the easiest way to address this

chrischdi · 2024-09-19T09:03:31Z

First of all, thanks for the great review!

I'm slightly confused by the premise of this PR given my understanding of @fabriziopandini's recent update to status.

A condition should be a perfectly suitable way to represent the drain state, but you mention there have been issues, do we have any of those written up that you can link to?

Both are used to track the starting time and used to check if we reached the timeouts instead of relying on the clusterv1.Condition's lastTransitionTime.

LastTransitionTime should only be changed when the status is change from true to false or vice versa, does that cause an issue here? Has it been behaving in that way?

This is to get rid of the issue that the lastTransitionTime get's updated due to changes to the condition and to prepare for the changes for v1beta2, where conditions don't have a lastTransitionTime anymore.

It sounds like yes, it sounds like any change to the condition has resulted in the transition time being updated, which is incorrect condition behaviour.

Yes that's the current behavior in the utils, but will be fixed with the v1beta2 conditions.

Also, lastTransitionTime is not going anywhere, see the docs for metav1.Condition here. The current implementation is compatible with the future set out in Fabrizio's plan.

That's true, but from the proposal: the target conditions for the final state of the Machine's do not contain the VolumeDetachSucceeded or DrainingSucceeded conditions we use currently to determine when we started draining/waiting for detach.

The information about the deletion status they showed should be moved / available at the Deleted condition instead.

JoelSpeed · 2024-09-19T09:44:34Z

api/v1beta1/machine_types.go

+type MachineStatusDeletion struct {
+	// NodeDrainStartTime is the time when the drain of the node started.
+	// Only present when the Machine has a deletionTimestamp, is being removed from the cluster
+	// and draining the node had been started.


Why would drain not have been started when there's a deletion timestamp?

Because of waiting for pre-drain hooks is one thing.

Note: the best pointer to overview the deletion process may be this: https://main.cluster-api.sigs.k8s.io/tasks/automated-machine-management/machine_deletions

I was thinking it might be useful to point a user to, or at least give them a hint to why node draining might not have started, can we include a line?

Another one may be the node is excluded from the drain entirely (there is an annotation for it).

I shortened it to:

// NodeDrainStartTime is the time when the drain of the node started and is used to determine // if the NodeDrainTimeout is exceeded. // Only present when the Machine has a deletionTimestamp and draining the node had been started.

Happy to get other suggestions though :-)

One way would be to say:

// NodeDrainStartTime is the time when the drain of the node started and is used to determine // if the NodeDrainTimeout is exceeded. // Only present when the Machine has a deletionTimestamp and draining the node had been started. // NodeDrainStartTime may not be set because the deletion is blocked by a pre-drain hook or draining is skipped for the machine.

But doing the same for WaitForNodeVolumeDetachStartTime would get very verbose 🤔 :

// WaitForNodeVolumeDetachStartTime is the time when waiting for volume detachment started // and is used to determine if the NodeVolumeDetachTimeout is exceeded. // Detaching volumes from nodes is usually done by CSI implementations and the current state // is observed from the node's `.Status.VolumesAttached` field. // Only present when the Machine has a deletionTimestamp and waiting for volume detachments had been started. // WaitForNodeVolumeDetachStartTime may not be set because the deletion is blocked by a pre-drain hook, stuck in drain or waiting for volume detachment is skipped for the machine. ```

api/v1beta1/machine_types.go

internal/controllers/machine/machine_controller.go

sbueringer

Last nit from my side, otherwise lgtm

internal/controllers/machine/machine_controller.go

sbueringer · 2024-09-25T12:10:11Z

/lgtm

k8s-ci-robot · 2024-09-25T12:10:18Z

LGTM label has been added.

Git tree hash: 2d7929ac4d700f1cf20b236e8e815c0509393408

sbueringer · 2024-09-27T10:19:13Z

@JoelSpeed @fabriziopandini Fine to merge from your side as well?

fabriziopandini

Looks good for me!
/lgtm

vincepri · 2024-09-28T05:26:04Z

api/v1beta1/machine_types.go

+	// +optional
+	NodeDrainStartTime *metav1.Time `json:"nodeDrainStartTime,omitempty"`
+
+	// waitForNodeVolumeDetachStartTime is the time when waiting for volume detachment started


Suggested change

// waitForNodeVolumeDetachStartTime is the time when waiting for volume detachment started

// WaitForNodeVolumeDetachStartTime is the time when waiting for volume detachment started

chrischdi added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Sep 10, 2024

k8s-ci-robot added area/machine Issues or PRs related to machine lifecycle management cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 10, 2024

k8s-ci-robot requested review from JoelSpeed and sbueringer September 10, 2024 15:04

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 10, 2024

chrischdi changed the title ~~🌱 machine: Introduce Deletion status field and add timestamps for drain and volumeDetach instead of using the condition~~ 🌱 [WIP] machine: Introduce Deletion status field and add timestamps for drain and volumeDetach instead of using the condition Sep 10, 2024

chrischdi added 3 commits September 10, 2024 17:43

machine: Introduce Deletion status field and add timestamps for drain…

42eac4c

… and volumeDetach instead of using the condition

fix tests

6a0422e

make generate

b2ddb80

chrischdi force-pushed the pr-deletion-status branch from 0451196 to b2ddb80 Compare September 10, 2024 15:44

chrischdi changed the title ~~🌱 [WIP] machine: Introduce Deletion status field and add timestamps for drain and volumeDetach instead of using the condition~~ 🌱 machine: Introduce Deletion status field and add timestamps for drain and volumeDetach instead of using the condition Sep 18, 2024

k8s-ci-robot assigned fabriziopandini and sbueringer Sep 18, 2024

fabriziopandini reviewed Sep 18, 2024

View reviewed changes

api/v1beta1/machine_types.go Show resolved Hide resolved

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 18, 2024

fabriziopandini changed the title ~~🌱 machine: Introduce Deletion status field and add timestamps for drain and volumeDetach instead of using the condition~~ ✨ machine: Introduce Deletion status field and add timestamps for drain and volumeDetach instead of using the condition Sep 18, 2024

JoelSpeed reviewed Sep 18, 2024

View reviewed changes

review fixes

95b28b5

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 19, 2024

k8s-ci-robot requested a review from fabriziopandini September 19, 2024 09:01

JoelSpeed reviewed Sep 19, 2024

View reviewed changes

fix openapi gen

bcf9d6a

sbueringer reviewed Sep 23, 2024

View reviewed changes

review fixes

9149759

sbueringer reviewed Sep 25, 2024

View reviewed changes

internal/controllers/machine/machine_controller.go Outdated Show resolved Hide resolved

fix

b4f0943

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 25, 2024

fabriziopandini reviewed Sep 27, 2024

View reviewed changes

vincepri reviewed Sep 28, 2024

View reviewed changes

	// Deletion is the deletion state of the Machine.
	// deletion contains information relating to removal of the Machine.
	// Only present when the Machine has a deletionTimestamp and is being removed from the cluster.

	// waitForNodeVolumeDetachStartTime is the time when waiting for volume detachment started
	// WaitForNodeVolumeDetachStartTime is the time when waiting for volume detachment started

✨ machine: Introduce Deletion status field and add timestamps for drain and volumeDetach instead of using the condition #11166

Are you sure you want to change the base?

✨ machine: Introduce Deletion status field and add timestamps for drain and volumeDetach instead of using the condition #11166

Conversation

chrischdi commented Sep 10, 2024

k8s-ci-robot commented Sep 10, 2024

chrischdi commented Sep 10, 2024

k8s-ci-robot commented Sep 10, 2024

chrischdi commented Sep 10, 2024

chrischdi commented Sep 10, 2024

chrischdi commented Sep 10, 2024

chrischdi commented Sep 11, 2024

chrischdi commented Sep 18, 2024

chrischdi commented Sep 18, 2024

k8s-ci-robot commented Sep 18, 2024

chrischdi commented Sep 18, 2024

fabriziopandini left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Sep 18, 2024

JoelSpeed left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sbueringer Sep 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sbueringer Sep 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrischdi commented Sep 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sbueringer left a comment

Choose a reason for hiding this comment

sbueringer commented Sep 25, 2024

k8s-ci-robot commented Sep 25, 2024

sbueringer commented Sep 27, 2024

fabriziopandini left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sbueringer Sep 23, 2024 •

edited

Loading

sbueringer Sep 23, 2024 •

edited

Loading