-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug the engine might choose a replica with a smaller head size to be the source of truth for auto-salvage #1114
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Good catch.
Update: I am also Modify the salvageRevisionCounterDisabledReplicas logic in this PR: The old logic is that:
The new logic:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only downside I can think of here is that repeated attempts to salvage will always choose the same replica. If that replica is broken somehow, we can never make a different choice. I think this is not a compelling enough reason to change the logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with Eric's notes. Looks good.
Looks like the github action is taking a long time waiting for an available runner: Do we need to change the runner from oracle-aarch64-4cpu-16gb to longhorn-infra-arm64-runners as we did in the longhorn-instance-manager https://github.com/longhorn/longhorn-instance-manager/pull/500/files ? Btw, may I have a question about the location (providers) of oracle-aarch64-4cpu-16gb and longhorn-infra-arm64-runners? @derekbit @FrankYang0529 ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change from map to slice looks good, too.
the source of truth for auto-salvage longhorn-8659 Signed-off-by: Phan Le <[email protected]>
The old logic is that: 1. Filter replica candidates to keep only replicas which was modified within the last 5 seconds from the last modified replica 2. Then filter to keep only replicas with head size equals to the biggest one 3. Then pick a random replica from the set The new logic: 1. Filter replica candiates to keep only replicas which was modified within the last 5 seconds from the last modified replica 2. Then filter to keep only replicas with head size equal to the biggest one 3. Then pick the last modified replica from the set longhorn-8659 longhorn-8563 Signed-off-by: Phan Le <[email protected]>
ping @shuo-wu Can I get an approve for merging this PR. Thank you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Then, we need records the broken replica and prevent picking up it in the retries. |
If a replica is broken beyond repairable state it wouldn't be able to start running state in the first place? In addition, if the best replica here is still broken, I would prefer the user manual intervention here instead. Wdyt? |
@mergify backport v1.6.x v1.5.x |
✅ Backports have been created
|
More details are in the issue description of longhorn/longhorn#8659