Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Status Conditions for FAR CR #69

Merged
merged 3 commits into from
Aug 1, 2023

Conversation

razo7
Copy link
Member

@razo7 razo7 commented Jul 27, 2023

Adding three status conditions (Processing, FenceAgentActionSucceeded, and Succeeded) to help FAR in two ways:

  • Convey the status of processing the CR, if the fence agent action was succeeded (only once), and if the whole remediation was succeeded (node was tainted with FAR taint, fence agent action was succeeded, and the workloads have been deleted).
  • Limit the available sections in the reconcile based on the conditions, e.g., reboot and resource deletion will finish successfully only once.

Moreover, each status condition includes a reason and a message based on ProcessingChangeReason that changed the condition value.
The PR also updates reconcile structure :

  • Fetch FAR CR, validate CR name, check NHC timeout annotation, and add finalizer. ProcessingChangeReason = RemediationStarted
  • Try to add FAR taint, build FA command and execute it (until it succeeds). ProcessingChangeReason = FenceAgentSucceeded
  • Try to delete workloads. ProcessingChangeReason = RemediationFinished

ECOPROJECT-1411
ECOPROJECT-1484

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 27, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 27, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: razo7

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@razo7 razo7 changed the title Add Remediation Phase for FAR CR [WIP] Add Remediation Phase for FAR CR Jul 27, 2023
Copy link
Member

@slintes slintes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use conditions

Some resources in the v1 API contain fields called phase, and associated message, reason, and other status fields. The pattern of using phase is deprecated. Newer API types should use conditions instead. Phase was essentially a state-machine enumeration field, that contradicted system-design principles and hampered evolution, since adding new enum values breaks backward compatibility. Rather than encouraging clients to infer implicit properties from phases, we prefer to explicitly expose the individual conditions that clients need to monitor. Conditions also have the benefit that it is possible to create some conditions with uniform meaning across all resource types, while still exposing others that are unique to specific resource types. See #7856 for more details and discussion.

https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties

@razo7 razo7 changed the title [WIP] Add Remediation Phase for FAR CR [WIP] Add Status Conditions for FAR CR Jul 30, 2023
@razo7
Copy link
Member Author

razo7 commented Jul 30, 2023

/test 4.13-openshift-e2e

@razo7
Copy link
Member Author

razo7 commented Jul 30, 2023

/test 4.13-openshift-e2e
/test 4.14-openshift-e2e

@razo7 razo7 changed the title [WIP] Add Status Conditions for FAR CR Add Status Conditions for FAR CR Jul 31, 2023
@mshitrit
Copy link
Member

/test 4.12-openshift-e2e

@razo7
Copy link
Member Author

razo7 commented Jul 31, 2023

/test 4.13-openshift-e2e
/test 4.14-openshift-e2e

api/v1alpha1/fenceagentsremediation_types.go Outdated Show resolved Hide resolved

const (
// RemediationStarted - CR was found, its name matches a node, and a finalizer was set
RemediationStarted ProcessingChangeReason = "remediationStarted"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reasons should be CamelCase


// Represents the observations of a FenceAgentsRemediation's current state.
// Known .status.conditions.type are: "Processing", "FenceAgentActionSucceeded", and "Succeeded".
// +patchMergeKey=type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

patch* doesn't work on CRDs, remove please

r.Log.Error(err, "Fence Agent response wasn't a success message", "CR's Name", req.Name)
return emptyResult, err

if meta.IsStatusConditionPresentAndEqual(far.Status.Conditions, commonConditions.ProcessingType, metav1.ConditionTrue) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • IMHO the check for the processing condition should be done earlier
  • a better check for the FASucceeded condition would be != True. Then you can set the condition to False in case execution failed.
  • I think the code which prepares command execution can be moved from above inside this if block? (getting the pod, building params...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a better check for the FASucceeded condition would be != True.

👍🏻

I think the code which prepares command execution can be moved from above inside this if block? (getting the pod, building params..

I though of that, and I went with the current implementation in order to limit the code sections which are affected by the status conditions. Having said that do you still see a greater value of adding the suggested code under the if block?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then you can set the condition to False in case execution failed.

ATM there is no ProcessingChangeReason for this use case. But I might add something for that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having said that do you still see a greater value of adding the suggested code under the if block?

yes, why executing all that code when it's not used 🤷🏼‍♂️

ATM there is no ProcessingChangeReason

tbh, I dislike this "one reason for updating all conditions" pattern anyway, and this is why...

}
r.Log.Info("FenceAgentsRemediation CR has completed to remediate the node", "Node Name", req.Name)

return ctrl.Result{Requeue: true}, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to requeue here? 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. Can be discarded

// +listType=map
// +listMapKey=type
// +operator-sdk:csv:customresourcedefinitions:type=status,displayName="conditions",xDescriptors="urn:alm:descriptor:io.kubernetes.conditions"
Conditions []metav1.Condition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type" protobuf:"bytes,1,rep,name=conditions"`
Conditions []metav1.Condition `json:"conditions,omitempty" protobuf:"bytes,1,rep,name=conditions"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missed in the first review: why the protobuf tag?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have followed the example of conditions from https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties, and it seems like the example is already adding this protobuf anyway in the CSV description. So I will delete it from
fenceagentsremediation_types.go to align with how we set conditions in other operators (without the protobuf tag), e.g. NHC's conditions .

Conditions []metav1.Condition json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type" protobuf:"bytes,1,rep,name=conditions"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doc is primarily targeting core k8s types. So in general it's good to follow its recommendations, but not everything applies to CRDs as well 🙂

@razo7
Copy link
Member Author

razo7 commented Jul 31, 2023

/test 4.13-openshift-e2e

r.Log.Error(err, "Invalid sharedParameters/nodeParameters from CR - edit/recreate the CR", "CR's Name", req.Name)
return emptyResult, nil
}
// Add FAR (medik8s) remediation taint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the taint be in this block as well? 🤔

@razo7
Copy link
Member Author

razo7 commented Jul 31, 2023

/test 4.13-openshift-e2e
/test 4.14-openshift-e2e

@razo7
Copy link
Member Author

razo7 commented Jul 31, 2023

/test 4.13-openshift-e2e
/test 4.14-openshift-e2e


var (
processingConditionStatus, fenceAgentActionSucceededConditionStatus, succeededConditionStatus metav1.ConditionStatus
conditinMessage string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: conditionMessage

const (
// FARFinalizer is a finalizer for a FenceAgentsRemediation CR deletion
FARFinalizer string = "fence-agents-remediation.medik8s.io/far-finalizer"
// Taints
FARNoExecuteTaintKey = "medik8s.io/fence-agents-remediation"
// FenceAgentActionSucceededType is the condition type used to signal whether the Fence Agent action was succeeded successfully or not
FenceAgentActionSucceededType = "FenceAgentActionSucceeded"
// error status messages
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these errors are used for tests only, correct? They shouldn't be in the api package then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved them to errors.go file

log.Info("Pod exist", "pod", podName)
}
}

// verifyStatusCondition checks whether the status condition is set with the expected value
func verifyStatusCondition(testFAR *v1alpha1.FenceAgentsRemediation, conditionType, expectedResult string, conditionStatus metav1.ConditionStatus) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't block on this, but using this "expectedResult" string is strange at least. Type and status params should be enough for the tests, not? Either they match or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I added the expectedResult so the Eventually can "catch" an expected error, e.g., in unit-test when the conditions haven't been set.
I have changed the name to verifyExpectedStatusConditionError as it better captures the essence of this function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could make conditionStatus a pointer and use nil, if you want to test that a condition isn't set 🙂

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then the test would fail, since eventually will expect utils.ConditionSetAndMatchSuccess while verifyStatusCondition/verifyExpectedStatusConditionError would return utils.ConditionSetButNoMatchError.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@razo7
Copy link
Member Author

razo7 commented Aug 1, 2023

/test 4.13-openshift-e2e
/test 4.14-openshift-e2e

@razo7
Copy link
Member Author

razo7 commented Aug 1, 2023

/test 4.13-openshift-e2e

Processing, FenceAgentActionSucceeded, and Succeeded status conditions have been added to verify FAR remediation status
Processing, FenceAgentActionSucceeded, and Succeeded status conditions help exclude the FA execution and workload deletion sections from every reconcile call. It would help us avoid any second (and more) remediation
Unit tests and e2e tests have been added to verify the expected behaviour by looking whether the status conditions have been set and what's their value
@razo7
Copy link
Member Author

razo7 commented Aug 1, 2023

/test 4.13-openshift-e2e
/test 4.14-openshift-e2e

@razo7
Copy link
Member Author

razo7 commented Aug 1, 2023

/retest

@razo7 razo7 marked this pull request as ready for review August 1, 2023 14:07
@slintes
Copy link
Member

slintes commented Aug 1, 2023

oh, CI missed to add the label

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Aug 1, 2023
@openshift-merge-robot openshift-merge-robot merged commit d6df2be into medik8s:main Aug 1, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants