Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update the vr condition #691

Merged
merged 1 commit into from
Nov 8, 2024
Merged

update the vr condition #691

merged 1 commit into from
Nov 8, 2024

Conversation

yati1998
Copy link
Contributor

@yati1998 yati1998 commented Oct 15, 2024

this commit updates the Volume Replication(VR) conditions to
include descriptive message for every
operations

Example of the VR status:

status:
  conditions:
  - lastTransitionTime: "2024-11-07T06:48:05Z"
    message: failed to demote volume
    observedGeneration: 3
    reason: FailedToDemote
    status: "False"
    type: Completed
  - lastTransitionTime: "2024-11-07T06:48:05Z"
    message: 'volume failed to demote: rpc error: code = NotFound desc = volume 0001-0009-rook-ceph-0000000000000001-337782ea-660f-4717-991d-7f2a8ac7a8ea
      not found: Failed as image not found (internal RBD image not found: rbd: ret=-2,
      No such file or directory)'
    observedGeneration: 3
    reason: Error
    status: "True"
    type: Degraded
  - lastTransitionTime: "2024-11-07T06:45:46Z"
    message: volume is not resyncing
    observedGeneration: 3
    reason: NotResyncing
    status: "False"
    type: Resyncing
  lastCompletionTime: "2024-11-07T06:45:46Z"

internal/controller/replication.storage/status.go Outdated Show resolved Hide resolved
internal/controller/replication.storage/status.go Outdated Show resolved Hide resolved
internal/controller/replication.storage/status.go Outdated Show resolved Hide resolved
internal/controller/replication.storage/status.go Outdated Show resolved Hide resolved
internal/controller/replication.storage/status.go Outdated Show resolved Hide resolved
internal/controller/replication.storage/status.go Outdated Show resolved Hide resolved
@nixpanic
Copy link
Collaborator

It would be good to have an example in the PR description. Ideally do not abbreviate things like "vr" in commit subjects.

@yati1998 yati1998 force-pushed the vr branch 2 times, most recently from cc55d78 to 4e4dac3 Compare October 16, 2024 04:03
@yati1998 yati1998 requested a review from nixpanic October 16, 2024 04:03
nixpanic
nixpanic previously approved these changes Oct 16, 2024
Copy link
Collaborator

@nixpanic nixpanic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Type: v1alpha1.ConditionDegraded,
Reason: v1alpha1.Error,
ObservedGeneration: observedGeneration,
Status: metav1.ConditionTrue,
})
setStatusCondition(conditions, &metav1.Condition{
Message: "failed to resync",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we define a const/type for all these messages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already have those defined , which is printed as the reason in the conditions.
These message should be more descriptive in my opinion hence waiting for shyam's response.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShyamsundarR please have a look and share your opinion here.

@yati1998
Copy link
Contributor Author

@ShyamsundarR I have updated the pr to include specific errors for failed conditions, do take a look

Copy link

@nirs nirs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the messages are very short and not descriptive. The VR .Status.Message looks like good error text (maybe too verbose sometimes).

Do we assume that users will use .Status.Message for reporting errors? How do we know which condition is related to .Status.Message?

I think we need to include .Status.Message in the relevant .Conditions[N].Message instead of the short and non descriptive messages.

Ramen looks only at the VR conditions, and it will be very easy to use if we can use the condition message in the ramen condition. I don't see how we can use the global .Status.Message.

Type: v1alpha1.ConditionCompleted,
Reason: v1alpha1.Promoted,
ObservedGeneration: observedGeneration,
Status: metav1.ConditionTrue,
})
setStatusCondition(conditions, &metav1.Condition{
Message: "Volume is healthy",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message does not add enough info toe understand the meaning of the condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is a sucessful condition, we dont have any information to be printed, VR is in healthy state. Can you please share some examples of what other information you want to get printed here?

Type: v1alpha1.ConditionDegraded,
Reason: v1alpha1.Healthy,
ObservedGeneration: observedGeneration,
Status: metav1.ConditionFalse,
})
setStatusCondition(conditions, &metav1.Condition{
Message: "Volume is not resyncing",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, it is jut the text version of NotResynching or Resynching=false. The text should explain the status.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same explanation as above, for this case, resync is not trigerred, hence volume is not resyncing, not sure what would be the detailed status for this?

@@ -46,26 +49,30 @@ func setPromotedCondition(conditions *[]metav1.Condition, observedGeneration int
}

// sets conditions when volume promotion was failed.
func setFailedPromotionCondition(conditions *[]metav1.Condition, observedGeneration int64) {
func setFailedPromotionCondition(conditions *[]metav1.Condition, observedGeneration int64, err string) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err is not a good name for an error message, this is the common name for objects implementing the error interface. Maybe use errorMesssage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes can be, will update

Type: v1alpha1.ConditionDegraded,
Reason: v1alpha1.Error,
ObservedGeneration: observedGeneration,
Status: metav1.ConditionTrue,
})
setStatusCondition(conditions, &metav1.Condition{
Message: "Volume is not resyncing",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we repeating the same value multiple times? If we need to repeat the same message, they should be constants.

Type: v1alpha1.ConditionCompleted,
Reason: v1alpha1.FailedToPromote,
ObservedGeneration: observedGeneration,
Status: metav1.ConditionFalse,
})
setStatusCondition(conditions, &metav1.Condition{
Message: "error detected while promotion",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is strange that we set the standard message for this condition, but custom message for the previous condition. Based on the function name this should affect only the PromotedCondition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no here, even the volume is not healthy as some error has been detected, the reason for this state says that error has been detected, now what we can do here is print the error that has caused the failure, but status.Message has that information, hence I prefered not duplication the information. if you feel it is important, we can add the same error here as well.

Type: v1alpha1.ConditionCompleted,
Reason: v1alpha1.FailedToPromote,
ObservedGeneration: observedGeneration,
Status: metav1.ConditionFalse,
})
setStatusCondition(conditions, &metav1.Condition{
Message: "error detected while promotion",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition is not related to promotion. Why do we need to talk about promotion here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this condition is reached while we were trying to promote and hence the error was detected right?

Type: v1alpha1.ConditionDegraded,
Reason: v1alpha1.Error,
ObservedGeneration: observedGeneration,
Status: metav1.ConditionTrue,
})
setStatusCondition(conditions, &metav1.Condition{
Message: "Volume is not resyncing",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message is capitalised, while other are not. I'm not sure what is the wanted format but all messages must be consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh missed this, will update

Type: v1alpha1.ConditionResyncing,
Reason: v1alpha1.NotResyncing,
ObservedGeneration: observedGeneration,
Status: metav1.ConditionFalse,
})
setStatusCondition(conditions, &metav1.Condition{
Message: "failed to meet prerequisite",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good summary of the error but we really want the reason that will help the admin debug this problem. What is the actual issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that will be in status.message

@@ -124,7 +124,7 @@ func (r *VolumeReplicationReconciler) Reconcile(ctx context.Context, req ctrl.Re
err = validatePrefixedParameters(vrcObj.Spec.Parameters)
if err != nil {
logger.Error(err, "failed to validate parameters of volumeReplicationClass", "VRCName", instance.Spec.VolumeReplicationClass)
setFailureCondition(instance)
setFailureCondition(instance, "failed to validate parameters of volumeReplicationClass")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Start to look helpful, but which parameter was invalid?

@@ -160,7 +160,7 @@ func (r *VolumeReplicationReconciler) Reconcile(ctx context.Context, req ctrl.Re
pvc, pv, pvErr = r.getPVCDataSource(logger, nameSpacedName)
if pvErr != nil {
logger.Error(pvErr, "failed to get PVC", "PVCName", instance.Spec.DataSource.Name)
setFailureCondition(instance)
setFailureCondition(instance, "failed to get PVC")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

failed to find the PVC? "Get" is a technical term, the function used to read an object from the api server, the human term is finding or looking up.

But it is not clear if the issue is missing pvc (ErrNotExist) or an error accessing the api server (maybe temporary error?). When we fix the code to handle both case we can provide a more useful error message.

@mergify mergify bot added the api Change to the API, requires extra care label Oct 30, 2024
@yati1998
Copy link
Contributor Author

The following comments have been addressed in the above PR:

  1. Error message from ceph csi is included in conditions of VR so that it can be picked up by VRG
  2. Error messages have been make const

The VR after this change looks like below:

Status:
  Conditions:
    Last Transition Time:  2024-10-15T04:21:29Z
    Message:              failed to demote 
    Observed Generation:   3
    Reason:                FailedToDemote
    Status:                False
    Type:                  Completed
    Last Transition Time:  2024-10-15T04:21:29Z
    Message:          Volume is degraded     
    Observed Generation:   3
    Reason:                volume 0001-0009-rook-ceph-0000000000000002-07d77e2c-5689-4493-b583-cf06c92e1db1 not found: Failed as image not found (internal RBD image not found)
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2024-10-15T04:19:17Z
    Message:           volume is not resyncing
    Observed Generation:   3
    Reason:                NotResyncing
    Status:                False
    Type:                  Resyncing
  Last Completion Time:    2024-10-15T04:19:18Z
  Message:                 volume 0001-0009-rook-ceph-0000000000000002-07d77e2c-5689-4493-b583-cf06c92e1db1 not found: Failed as image not found (internal RBD image not found)
  Observed Generation:     3
  State:                   Primary
Events:                    <none>

cc @nirs @Madhu-1 @nixpanic @ShyamsundarR please add your reviews to the changes

@yati1998
Copy link
Contributor Author

yati1998 commented Nov 4, 2024

Looks like we create useful message now, but it is not clear to me when and why messageFromDriver apply to which condition.

Can you share example VR status in all the error cases? It can be simulated by faking failures, or reproducing failure conditions on a real system (e.g. trying to replicate a cloned pvc with the wrong dr policy).

If you provide an image, I can test this in ramen testing env.

sure, let me create an image and share with you

@nirs
Copy link

nirs commented Nov 5, 2024

I install the image provided by @yati1998 using this:
RamenDR/ramen#1639

It passed ramen e2e job:
https://github.com/RamenDR/ramen/actions/runs/11683242163/job/32532029150

I'm testing the message for validation issue now.

@nirs
Copy link

nirs commented Nov 5, 2024

I does not work. I deployed a deployment using pvc restored from snapshot, and tried to protect is. This is expected to fail since the dr policy is not enabled for flattening.
https://github.com/nirs/ocm-ramen-samples/tree/flatten/workloads/flatten/regional-dr

I get empty messages for all conditions instead of the expected error message.

% kubectl get vr -A --context dr1 -o yaml
apiVersion: v1
items:
- apiVersion: replication.storage.openshift.io/v1alpha1
  kind: VolumeReplication
  metadata:
    creationTimestamp: "2024-11-05T13:08:15Z"
    finalizers:
    - replication.storage.openshift.io
    generation: 1
    labels:
      ramendr.openshift.io/owner-name: flatten-drpc
      ramendr.openshift.io/owner-namespace-name: ramen-ops
    name: restored-pvc
    namespace: flatten
    resourceVersion: "27554"
    uid: 555ff4ca-dc7e-4076-a4c5-323609fb733e
  spec:
    autoResync: false
    dataSource:
      apiGroup: ""
      kind: PersistentVolumeClaim
      name: restored-pvc
    replicationHandle: ""
    replicationState: primary
    volumeReplicationClass: vrc-sample
  status:
    conditions:
    - lastTransitionTime: "2024-11-05T13:08:15Z"
      message: ""
      observedGeneration: 1
      reason: FailedToPromote
      status: "False"
      type: Completed
    - lastTransitionTime: "2024-11-05T13:08:15Z"
      message: ""
      observedGeneration: 1
      reason: Error
      status: "True"
      type: Degraded
    - lastTransitionTime: "2024-11-05T13:08:15Z"
      message: ""
      observedGeneration: 1
      reason: NotResyncing
      status: "False"
      type: Resyncing
    - lastTransitionTime: "2024-11-05T13:08:15Z"
      message: ""
      observedGeneration: 1
      reason: PrerequisiteNotMet
      status: "False"
      type: Validated
    message: 'system is not in a state required for the operation''s execution: failed
      to enable mirroring on image "replicapool/csi-vol-f4737b6e-eeff-4137-8248-301cf37a3368":
      parent image "replicapool/csi-snap-e7c91292-a272-4278-9ee9-6be7a4c8bfe0" is
      not enabled for mirroring'
    observedGeneration: 1
    state: Unknown
kind: List
metadata:
  resourceVersion: ""

@ShyamsundarR
Copy link

Changes to reflect message in condition to include errors from the RPCs is good, moves it closer to the condition that faced the error.

This still does not catch certain errors as @yati1998 pointed out, e.g MirrorEnable returns the -22 and no description, so we miss the context further that the error was actually something like this:

$ rbd mirror image enable ocs-storagecluster-cephblockpool/csi-vol-7c561e23-151b-4f8d-a219-41abd252d713 snapshot 
2024-10-25T13:56:25.349+0000 7f3006be3c00 -1 librbd::api::Mirror: image_enable: image has a parent, snapshot based mirroring is not supported

Which is being addressed in ceph/ceph-csi#4941 we may need other corner cases as it may arise to report better failures/errors. Otherwise the move to report known errors and failures in messages with this PR looks good.

(not adding a +1 as I am not looking at specific of the change overall)

@yati1998
Copy link
Contributor Author

yati1998 commented Nov 5, 2024

Changes to reflect message in condition to include errors from the RPCs is good, moves it closer to the condition that faced the error.

This still does not catch certain errors as @yati1998 pointed out, e.g MirrorEnable returns the -22 and no description, so we miss the context further that the error was actually something like this:

$ rbd mirror image enable ocs-storagecluster-cephblockpool/csi-vol-7c561e23-151b-4f8d-a219-41abd252d713 snapshot 
2024-10-25T13:56:25.349+0000 7f3006be3c00 -1 librbd::api::Mirror: image_enable: image has a parent, snapshot based mirroring is not supported

Which is being addressed in ceph/ceph-csi#4941 we may need other corner cases as it may arise to report better failures/errors. Otherwise the move to report known errors and failures in messages with this PR looks good.

(not adding a +1 as I am not looking at specific of the change overall)

Hey @ShyamsundarR this error was updated further, I think you are using older version of cephcsi, due to which it seems just the error code is returned. https://github.com/ceph/ceph-csi/pull/4678/files#diff-d3713dcbb075d877c69230700478ec60c5fe029d742dae0f72499d9f6833617f

@ShyamsundarR
Copy link

Hey @ShyamsundarR this error was updated further, I think you are using older version of cephcsi, due to which it seems just the error code is returned. https://github.com/ceph/ceph-csi/pull/4678/files#diff-d3713dcbb075d877c69230700478ec60c5fe029d742dae0f72499d9f6833617f

Quite possible, the builds in use were downstream and possibly ones in release-4.16. Overall, the new PR to improve the error message from Ceph-CSI is helpful and relates to this PR, hence cross posting it here.

@nirs
Copy link

nirs commented Nov 5, 2024

I does not work. I deployed a deployment using pvc restored from snapshot, and tried to protect is. This is expected to fail since the dr policy is not enabled for flattening. https://github.com/nirs/ocm-ramen-samples/tree/flatten/workloads/flatten/regional-dr

I get empty messages for all conditions instead of the expected error message.

The test was wrong, we used the manifests form 0.10.0. Testing again with the manifests from this branch and a new version of the image:
https://github.com/RamenDR/ramen/compare/c6135d3dcbf198f302c1bc1f98ebd6a4bb2e8e2e..d65c6eb9ed478c3affd8b18adb061094c078798b

@nirs
Copy link

nirs commented Nov 5, 2024

Works now with fixed ramen change:
RamenDR/ramen@c496fe9

% kubectl get vr -A --context dr1 -o yaml              
apiVersion: v1
items:
- apiVersion: replication.storage.openshift.io/v1alpha1
  kind: VolumeReplication
  metadata:
    creationTimestamp: "2024-11-05T13:42:41Z"
    finalizers:
    - replication.storage.openshift.io
    generation: 1
    labels:
      ramendr.openshift.io/owner-name: flatten-drpc
      ramendr.openshift.io/owner-namespace-name: ramen-ops
    name: restored-pvc
    namespace: flatten
    resourceVersion: "40696"
    uid: 9bf65b35-b3da-42dc-a9c0-fbb6a3927b1b
  spec:
    autoResync: false
    dataSource:
      apiGroup: ""
      kind: PersistentVolumeClaim
      name: restored-pvc
    replicationHandle: ""
    replicationState: primary
    volumeReplicationClass: vrc-sample
  status:
    conditions:
    - lastTransitionTime: "2024-11-05T13:42:41Z"
      message: failed to promote volume
      observedGeneration: 1
      reason: FailedToPromote
      status: "False"
      type: Completed
    - lastTransitionTime: "2024-11-05T13:42:41Z"
      message: failed to enable volume replication
      observedGeneration: 1
      reason: Error
      status: "True"
      type: Degraded
    - lastTransitionTime: "2024-11-05T13:42:41Z"
      message: volume is not resyncing
      observedGeneration: 1
      reason: NotResyncing
      status: "False"
      type: Resyncing
    - lastTransitionTime: "2024-11-05T13:42:41Z"
      message: 'failed to meet prerequisite: rpc error: code = FailedPrecondition
        desc = system is not in a state required for the operation''s execution: failed
        to enable mirroring on image "replicapool/csi-vol-f4737b6e-eeff-4137-8248-301cf37a3368":
        parent image "replicapool/csi-snap-e7c91292-a272-4278-9ee9-6be7a4c8bfe0" is
        not enabled for mirroring'
      observedGeneration: 1
      reason: PrerequisiteNotMet
      status: "False"
      type: Validated
    message: 'system is not in a state required for the operation''s execution: failed
      to enable mirroring on image "replicapool/csi-vol-f4737b6e-eeff-4137-8248-301cf37a3368":
      parent image "replicapool/csi-snap-e7c91292-a272-4278-9ee9-6be7a4c8bfe0" is
      not enabled for mirroring'
    observedGeneration: 1
    state: Unknown
kind: List
metadata:
  resourceVersion: ""

The duplicate .Status.message can be eliminating, but maybe someone is depending on this?

@yati1998
Copy link
Contributor Author

yati1998 commented Nov 6, 2024

The duplicate .Status.message can be eliminating, but maybe someone is depending on this?

Yes we can't eliminate that.

@yati1998
Copy link
Contributor Author

yati1998 commented Nov 6, 2024

since @nirs has also tested the change is statisfied with the current modifications, I would request a final review on this.
I have addressed all the comments above .

cc @nixpanic @nirs @ShyamsundarR @Madhu-1

nirs added a commit to nirs/ramen that referenced this pull request Nov 6, 2024
Use a development build of csi-addons adding .Message to the all
conditions:
csi-addons/kubernetes-csi-addons#691

This is only needed for testing, I'll remove this before merging.

Signed-off-by: Nir Soffer <[email protected]>
nirs added a commit to nirs/ramen that referenced this pull request Nov 6, 2024
Based on the messages addded in
csi-addons/kubernetes-csi-addons#691. We want to
propagate the error messages to the protected pvcs conditions.

Signed-off-by: Nir Soffer <[email protected]>
this commit updates the volumereplication conditions to
include descriptive message for every
operations

Signed-off-by: yati1998 <[email protected]>
@yati1998
Copy link
Contributor Author

yati1998 commented Nov 7, 2024

@nirs @nixpanic @ShyamsundarR can you please give your final review here?

Copy link

@nirs nirs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome, thanks!

Copy link
Collaborator

@nixpanic nixpanic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mergify mergify bot merged commit 8b610a1 into csi-addons:main Nov 8, 2024
15 checks passed
nirs added a commit to nirs/ramen that referenced this pull request Nov 10, 2024
Based on the messages addded in
csi-addons/kubernetes-csi-addons#691. We want to
propagate the error messages to the protected pvcs conditions.

Signed-off-by: Nir Soffer <[email protected]>
BenamarMk pushed a commit to RamenDR/ramen that referenced this pull request Nov 11, 2024
Based on the messages addded in
csi-addons/kubernetes-csi-addons#691. We want to
propagate the error messages to the protected pvcs conditions.

Signed-off-by: Nir Soffer <[email protected]>
ShyamsundarR pushed a commit to ShyamsundarR/ramen that referenced this pull request Nov 16, 2024
Based on the messages addded in
csi-addons/kubernetes-csi-addons#691. We want to
propagate the error messages to the protected pvcs conditions.

Signed-off-by: Nir Soffer <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Change to the API, requires extra care
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants