Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ machine: Introduce Deletion status field and add timestamps for drain and volumeDetach instead of using the condition #11166

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions api/v1beta1/machine_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -230,10 +230,22 @@ type MachineStatus struct {
// Conditions defines current service state of the Machine.
// +optional
Conditions Conditions `json:"conditions,omitempty"`

sbueringer marked this conversation as resolved.
Show resolved Hide resolved
// Deletion is the deletion state of the Machine.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did we land with getting the godocs to start with the serialised versions?

Perhaps a little more context might be useful?

Suggested change
// Deletion is the deletion state of the Machine.
// deletion contains information relating to removal of the Machine.
// Only present when the Machine has a deletionTimestamp and is being removed from the cluster.

Copy link
Member

@sbueringer sbueringer Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did we land with getting the godocs to start with the serialised versions?

First time I'm hearing about this :). godoc conventions suggest it should be upper case, right? (and that's what we do today everywhere in core CAPI)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see it explicitly called out in the API conventions from upstream, but, take a look at the godoc across the core types (and I know the upstream API reviewers enforce this from experience), https://github.com/kubernetes/api/blob/master/core/v1/types.go, all of the godocs on fields start with the json tag version of the field name, so that when docs are generated, they represent the serialised versions of the strings

Copy link
Member

@sbueringer sbueringer Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Thank you!

@chrischdi Can you please change it here and also open a help-wanted issue so we change it everywhere ?

// +optional
sbueringer marked this conversation as resolved.
Show resolved Hide resolved
Deletion MachineStatusDeletion `json:"deletion,omitempty"`
sbueringer marked this conversation as resolved.
Show resolved Hide resolved
}

// ANCHOR_END: MachineStatus

// MachineStatusDeletion is the deletion state of the Machine.
type MachineStatusDeletion struct {
chrischdi marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include information here related to the drain, such as the configuration for the timeout?

Since the drain fields are optional in the spec, it would be good perhaps to show the configured values here so that you can correlate between start time and expected end time just by looking at the status?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting idea, not really sure how we can represent we are waiting forever in a clear way (not showing timeout 0).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A value of -1 potentially? But you're right, timeout 0 is awkward 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the status that it is still waiting should then be part of the condition Deleting condition message? 🤔 (or as long as that one is not around, the DrainSucceeded condition message.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, having this in the conditions seems the easiest way to address this

// NodeDrainStartTime is the time when the drain of the node started.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we can probably add some more context to this, what does it mean when it's not present for example, what does it mean if it has elapsed for some period and the Machine is still here, what hints can we give to end users?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to add some more context on both keys.

NodeDrainStartTime *metav1.Time `json:"nodeDrainStartTime,omitempty"`
sbueringer marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need a finish time? Same for the other field?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally don't think that drain or detach finish time is an important info to have in the API (users & SRE mostly care about what is going on now and eventually why it is stuck, rarely they care about at what happened and for this log a more exhaustive)
But no strong opinion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I know that these have started, how do I know that they have finished if I don't have some field to tell me? 🤔 I realise eventually the machine is going away, but, what if it gets stuck terminating the instance, will that show up somewhere and will I know that drain and volume detach are done?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do I know that they have finished if I don't have some field to tell me? 🤔

From controller-perspective we (at least currently) do not care:

  • either the controller tries to drain again which should be a no-op (happy path)
  • or the drain is skipped because the timeout is reached.

From the user perspective: the information where the deletion is at should be part of the Deleting condition I'd say.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in general the new Deleting condition should make clear at which phase of the deletion workflow we are (including making clear which parts are already completed)

// WaitForNodeVolumeDetachStartTime is the time when waiting for volume detachment started.
sbueringer marked this conversation as resolved.
Show resolved Hide resolved
WaitForNodeVolumeDetachStartTime *metav1.Time `json:"nodeVolumeDetachStartTime,omitempty"`
}

// SetTypedPhase sets the Phase field to the string representation of MachinePhase.
func (m *MachineStatus) SetTypedPhase(p MachinePhase) {
m.Phase = string(p)
Expand Down
24 changes: 24 additions & 0 deletions api/v1beta1/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

37 changes: 36 additions & 1 deletion api/v1beta1/zz_generated.openapi.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

14 changes: 14 additions & 0 deletions config/crd/bases/cluster.x-k8s.io_machines.yaml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions internal/apis/core/v1alpha3/conversion.go
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,7 @@ func (src *Machine) ConvertTo(dstRaw conversion.Hub) error {
dst.Spec.NodeVolumeDetachTimeout = restored.Spec.NodeVolumeDetachTimeout
dst.Status.NodeInfo = restored.Status.NodeInfo
dst.Status.CertificatesExpiryDate = restored.Status.CertificatesExpiryDate
dst.Status.Deletion = restored.Status.Deletion
return nil
}

Expand Down
1 change: 1 addition & 0 deletions internal/apis/core/v1alpha3/zz_generated.conversion.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions internal/apis/core/v1alpha4/conversion.go
Original file line number Diff line number Diff line change
Expand Up @@ -190,6 +190,7 @@ func (src *Machine) ConvertTo(dstRaw conversion.Hub) error {
dst.Spec.NodeDeletionTimeout = restored.Spec.NodeDeletionTimeout
dst.Status.CertificatesExpiryDate = restored.Status.CertificatesExpiryDate
dst.Spec.NodeVolumeDetachTimeout = restored.Spec.NodeVolumeDetachTimeout
dst.Status.Deletion = restored.Status.Deletion
return nil
}

Expand Down
1 change: 1 addition & 0 deletions internal/apis/core/v1alpha4/zz_generated.conversion.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

27 changes: 17 additions & 10 deletions internal/controllers/machine/machine_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ import (
"k8s.io/apimachinery/pkg/util/wait"
"k8s.io/client-go/tools/record"
"k8s.io/klog/v2"
"k8s.io/utils/ptr"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/builder"
"sigs.k8s.io/controller-runtime/pkg/client"
Expand Down Expand Up @@ -415,6 +416,10 @@ func (r *Reconciler) reconcileDelete(ctx context.Context, cluster *clusterv1.Clu
conditions.MarkFalse(m, clusterv1.VolumeDetachSucceededCondition, clusterv1.WaitingForVolumeDetachReason, clusterv1.ConditionSeverityInfo, "Waiting for node volumes to be detached")
}

if m.Status.Deletion.WaitForNodeVolumeDetachStartTime == nil {
m.Status.Deletion.WaitForNodeVolumeDetachStartTime = ptr.To(metav1.Now())
}

if ok, err := r.shouldWaitForNodeVolumes(ctx, cluster, m.Status.NodeRef.Name); ok || err != nil {
if err != nil {
r.recorder.Eventf(m, corev1.EventTypeWarning, "FailedWaitForVolumeDetach", "error waiting for node volumes detaching, Machine's node %q: %v", m.Status.NodeRef.Name, err)
Expand Down Expand Up @@ -527,34 +532,32 @@ func (r *Reconciler) nodeDrainTimeoutExceeded(machine *clusterv1.Machine) bool {
return false
}

// if the draining succeeded condition does not exist
if conditions.Get(machine, clusterv1.DrainingSucceededCondition) == nil {
// if the NodeDrainStartTime does not exist
if machine.Status.Deletion.NodeDrainStartTime == nil {
return false
}

now := time.Now()
firstTimeDrain := conditions.GetLastTransitionTime(machine, clusterv1.DrainingSucceededCondition)
diff := now.Sub(firstTimeDrain.Time)
diff := now.Sub(machine.Status.Deletion.NodeDrainStartTime.Time)
return diff.Seconds() >= machine.Spec.NodeDrainTimeout.Seconds()
}

// nodeVolumeDetachTimeoutExceeded returns False if either NodeVolumeDetachTimeout is set to nil or <=0 OR
// VolumeDetachSucceededCondition is not set on the Machine. Otherwise returns true if the timeout is expired
// since the last transition time of VolumeDetachSucceededCondition.
// WaitForNodeVolumeDetachStartTime is not set on the Machine. Otherwise returns true if the timeout is expired
// since the WaitForNodeVolumeDetachStartTime.
func (r *Reconciler) nodeVolumeDetachTimeoutExceeded(machine *clusterv1.Machine) bool {
// if the NodeVolumeDetachTimeout type is not set by user
if machine.Spec.NodeVolumeDetachTimeout == nil || machine.Spec.NodeVolumeDetachTimeout.Seconds() <= 0 {
return false
}

// if the volume detaching succeeded condition does not exist
if conditions.Get(machine, clusterv1.VolumeDetachSucceededCondition) == nil {
// if the NodeVolumeDetachStartTime does not exist
sbueringer marked this conversation as resolved.
Show resolved Hide resolved
if machine.Status.Deletion.WaitForNodeVolumeDetachStartTime == nil {
return false
}

now := time.Now()
firstTimeDetach := conditions.GetLastTransitionTime(machine, clusterv1.VolumeDetachSucceededCondition)
diff := now.Sub(firstTimeDetach.Time)
diff := now.Sub(machine.Status.Deletion.WaitForNodeVolumeDetachStartTime.Time)
return diff.Seconds() >= machine.Spec.NodeVolumeDetachTimeout.Seconds()
}

Expand Down Expand Up @@ -697,6 +700,10 @@ func (r *Reconciler) drainNode(ctx context.Context, cluster *clusterv1.Cluster,
log.V(3).Info("Node is unreachable, draining will use 1s GracePeriodSeconds and will ignore all Pods that have a deletionTimestamp > 1s old")
}

if machine.Status.Deletion.NodeDrainStartTime == nil {
sbueringer marked this conversation as resolved.
Show resolved Hide resolved
machine.Status.Deletion.NodeDrainStartTime = ptr.To(metav1.Now())
}

if err := drainer.CordonNode(ctx, node); err != nil {
// Machine will be re-reconciled after a cordon failure.
return ctrl.Result{}, errors.Wrapf(err, "failed to cordon Node %s", node.Name)
Expand Down
48 changes: 12 additions & 36 deletions internal/controllers/machine/machine_controller_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -1386,12 +1386,8 @@ func TestIsNodeDrainedAllowed(t *testing.T) {
},

Status: clusterv1.MachineStatus{
Conditions: clusterv1.Conditions{
{
Type: clusterv1.DrainingSucceededCondition,
Status: corev1.ConditionFalse,
LastTransitionTime: metav1.Time{Time: time.Now().Add(-(time.Second * 70)).UTC()},
},
Deletion: clusterv1.MachineStatusDeletion{
NodeDrainStartTime: &metav1.Time{Time: time.Now().Add(-(time.Second * 70)).UTC()},
},
},
},
Expand All @@ -1412,12 +1408,8 @@ func TestIsNodeDrainedAllowed(t *testing.T) {
NodeDrainTimeout: &metav1.Duration{Duration: time.Second * 60},
},
Status: clusterv1.MachineStatus{
Conditions: clusterv1.Conditions{
{
Type: clusterv1.DrainingSucceededCondition,
Status: corev1.ConditionFalse,
LastTransitionTime: metav1.Time{Time: time.Now().Add(-(time.Second * 30)).UTC()},
},
Deletion: clusterv1.MachineStatusDeletion{
NodeDrainStartTime: &metav1.Time{Time: time.Now().Add(-(time.Second * 30)).UTC()},
},
},
},
Expand All @@ -1437,12 +1429,8 @@ func TestIsNodeDrainedAllowed(t *testing.T) {
Bootstrap: clusterv1.Bootstrap{DataSecretName: ptr.To("data")},
},
Status: clusterv1.MachineStatus{
Conditions: clusterv1.Conditions{
{
Type: clusterv1.DrainingSucceededCondition,
Status: corev1.ConditionFalse,
LastTransitionTime: metav1.Time{Time: time.Now().Add(-(time.Second * 1000)).UTC()},
},
Deletion: clusterv1.MachineStatusDeletion{
NodeDrainStartTime: &metav1.Time{Time: time.Now().Add(-(time.Second * 1000)).UTC()},
},
},
},
Expand Down Expand Up @@ -1896,12 +1884,8 @@ func TestIsNodeVolumeDetachingAllowed(t *testing.T) {
},

Status: clusterv1.MachineStatus{
Conditions: clusterv1.Conditions{
{
Type: clusterv1.VolumeDetachSucceededCondition,
Status: corev1.ConditionFalse,
LastTransitionTime: metav1.Time{Time: time.Now().Add(-(time.Second * 60)).UTC()},
},
Deletion: clusterv1.MachineStatusDeletion{
WaitForNodeVolumeDetachStartTime: &metav1.Time{Time: time.Now().Add(-(time.Second * 60)).UTC()},
},
},
},
Expand All @@ -1922,12 +1906,8 @@ func TestIsNodeVolumeDetachingAllowed(t *testing.T) {
NodeVolumeDetachTimeout: &metav1.Duration{Duration: time.Second * 60},
},
Status: clusterv1.MachineStatus{
Conditions: clusterv1.Conditions{
{
Type: clusterv1.VolumeDetachSucceededCondition,
Status: corev1.ConditionFalse,
LastTransitionTime: metav1.Time{Time: time.Now().Add(-(time.Second * 30)).UTC()},
},
Deletion: clusterv1.MachineStatusDeletion{
WaitForNodeVolumeDetachStartTime: &metav1.Time{Time: time.Now().Add(-(time.Second * 30)).UTC()},
},
},
},
Expand All @@ -1947,12 +1927,8 @@ func TestIsNodeVolumeDetachingAllowed(t *testing.T) {
Bootstrap: clusterv1.Bootstrap{DataSecretName: ptr.To("data")},
},
Status: clusterv1.MachineStatus{
Conditions: clusterv1.Conditions{
{
Type: clusterv1.VolumeDetachSucceededCondition,
Status: corev1.ConditionFalse,
LastTransitionTime: metav1.Time{Time: time.Now().Add(-(time.Second * 1000)).UTC()},
},
Deletion: clusterv1.MachineStatusDeletion{
WaitForNodeVolumeDetachStartTime: &metav1.Time{Time: time.Now().Add(-(time.Second * 1000)).UTC()},
},
},
},
Expand Down
Loading