Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for Generic error for persistent task on starting replication #1003

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

nisgoel-amazon
Copy link
Contributor

Description

This change will fix the generic error coming in case of replication failed in restore phase.
This change will also return the acknowledgement as 200 when replication failed in restore phase and put the failed reason in metadata store, which can be view via replication status API

Issues Resolved

#707

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@@ -828,7 +828,11 @@ open class IndexReplicationTask(id: Long, type: String, action: String, descript
}
} catch(e: Exception) {
val err = "Unable to initiate restore call for $followerIndexName from $leaderAlias:${leaderIndex.name}"
val aliasErrMsg = "cannot rename index [${leaderIndex.name}] into [$followerIndexName] because of conflict with an alias with the same name"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relying on exception message isn't robust as any simple change in string will impact the logic.

Have we exhausted other ways to detect? (Exception type, error code etc). If so, we may want to check on substring instead of complete sentence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just to show specific error message in case of alias conflict. Main concern is to handle failed state in case of restore state. If string will change in future as well, then also we get the failed state with message related to "unable to initiate restore" and more specific error will come in logs

Comment on lines 130 to 131
(!replicateIndexReq.waitForRestore && replicationState == ReplicationState.RESTORING)
(!replicateIndexReq.waitForRestore && replicationState == ReplicationState.RESTORING) ||
(!replicateIndexReq. waitForRestore && replicationState == ReplicationState. FAILED)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does handling for FAILED exist?

(Also a nit - there is a space after the .)

Does this need any update to test case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh yes thanks for pointing out the space after .

@@ -828,7 +828,11 @@ open class IndexReplicationTask(id: Long, type: String, action: String, descript
}
} catch(e: Exception) {
val err = "Unable to initiate restore call for $followerIndexName from $leaderAlias:${leaderIndex.name}"
val aliasErrMsg = "cannot rename index [${leaderIndex.name}] into [$followerIndexName] because of conflict with an alias with the same name"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This scenario is one of cases in which the replication fails to start and proper message is not thrown. We need to make it generic to cover all cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

@nisgoel-amazon nisgoel-amazon Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, For other cases if something fails in restore state then we will get this message in index replication status API "Unable to initiate restore call for followerIndexName from leaderAlias:{leaderIndex.name}".

Initially when failed state was not handled properly, we get generic error of "Timed out when waiting for persistent task after 1m". Which might happened because of failure in restore state or error in any other state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the failed state message with whatever exception message will come

@codecov
Copy link

codecov bot commented Jun 21, 2023

Codecov Report

Attention: Patch coverage is 33.33333% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 74.64%. Comparing base (d404742) to head (6dcbd7d).
Report is 2 commits behind head on main.

❗ Current head 6dcbd7d differs from pull request most recent head faa1612. Consider uploading reports for the commit faa1612 to get more accurate results

Files Patch % Lines
...TransportReplicateIndexClusterManagerNodeAction.kt 50.00% 0 Missing and 1 partial ⚠️
...rch/replication/task/index/IndexReplicationTask.kt 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1003      +/-   ##
============================================
- Coverage     76.15%   74.64%   -1.52%     
+ Complexity     1042     1026      -16     
============================================
  Files           141      141              
  Lines          4785     4784       -1     
  Branches        527      527              
============================================
- Hits           3644     3571      -73     
- Misses          789      863      +74     
+ Partials        352      350       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants