Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for Generic error for persistent task on starting replication #1003

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,8 @@ class TransportReplicateIndexClusterManagerNodeAction @Inject constructor(transp
persistentTasksService.waitForTaskCondition(task.id, replicateIndexReq.timeout()) { t ->
val replicationState = (t.state as IndexReplicationState?)?.state
replicationState == ReplicationState.FOLLOWING ||
(!replicateIndexReq.waitForRestore && replicationState == ReplicationState.RESTORING)
(!replicateIndexReq.waitForRestore && replicationState == ReplicationState.RESTORING) ||
(!replicateIndexReq. waitForRestore && replicationState == ReplicationState. FAILED)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does handling for FAILED exist?

(Also a nit - there is a space after the .)

Does this need any update to test case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh yes thanks for pointing out the space after .

}

listener.onResponse(AcknowledgedResponse(true))
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -828,7 +828,11 @@ open class IndexReplicationTask(id: Long, type: String, action: String, descript
}
} catch(e: Exception) {
val err = "Unable to initiate restore call for $followerIndexName from $leaderAlias:${leaderIndex.name}"
val aliasErrMsg = "cannot rename index [${leaderIndex.name}] into [$followerIndexName] because of conflict with an alias with the same name"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relying on exception message isn't robust as any simple change in string will impact the logic.

Have we exhausted other ways to detect? (Exception type, error code etc). If so, we may want to check on substring instead of complete sentence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just to show specific error message in case of alias conflict. Main concern is to handle failed state in case of restore state. If string will change in future as well, then also we get the failed state with message related to "unable to initiate restore" and more specific error will come in logs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This scenario is one of cases in which the replication fails to start and proper message is not thrown. We need to make it generic to cover all cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

@nisgoel-amazon nisgoel-amazon Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, For other cases if something fails in restore state then we will get this message in index replication status API "Unable to initiate restore call for followerIndexName from leaderAlias:{leaderIndex.name}".

Initially when failed state was not handled properly, we get generic error of "Timed out when waiting for persistent task after 1m". Which might happened because of failure in restore state or error in any other state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the failed state message with whatever exception message will come

log.error(err, e)
if (e.message!!.contains(aliasErrMsg)) {
return FailedState (Collections.emptyMap(), aliasErrMsg)
}
return FailedState(Collections.emptyMap(), err)
}
cso.waitForNextChange("remote restore start") { inProgressRestore(it) != null }
Expand Down