-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support a different timeout for the last replica #1176
Conversation
7d49593
to
5e0858b
Compare
UPDATE: Keeping this comment for historical purpose, but it is no longer relevant. The approach in this PR is now much different. I had to enable pprof on the engine to understand why this wasn't working the way I wanted up until at least b8e8d7a. The reason is this:
In a test case in which I blocked communication with all replicas simultaneously:
NOTE: Go's locks are write biased. Once a writer attempts to acquire the lock, no new readers can acquire it until after the writer has. |
828f7aa
to
315faca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Longhorn 8711 Signed-off-by: Eric Weber <[email protected]>
Longhorn 8711 Signed-off-by: Eric Weber <[email protected]>
…outShort Longhorn 8711 Signed-off-by: Eric Weber <[email protected]>
f0077e3
to
0e9215b
Compare
I ended up making the long timeout double the short one and not adjusting the potential range of the short one. This makes the timeouts 8s / 16s out-of-the-box with the ability to increase them to 30s / 60s with the caveat from longhorn/longhorn#8711 (comment). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll buy that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Only two comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thank you for the good additional refactoring on the timeOfLastActivity
logic. It is easier to understand compared to the previous logic
@mergify backport v1.7.x v1.6.x |
✅ Backports have been created
|
Which issue(s) this PR fixes:
longhorn/longhorn#8711
What this PR does / why we need it:
The implementation in this PR makes it possible to configure the engine so that there are two
engineReplicaTimeouts
(short and long). The backends lightly coordinate via a newSharedTimeouts
struct to ensure that most of them can time out in the normal way afterengineReplicaTimeoutShort
, but exactly one of them must waitengineReplicaTimeoutLong
to do the same.Note that this PR does NOT actually configure a differentengineReplicaTimeoutLong
. My plan is to do that in a followup after this one is approved and we decide exactly how we want to expose the new capability.Special notes for your reviewer:
Additional documentation or context
Per #1176 (comment), I experimented with a different approach in https://github.com/ejweber/longhorn-engine/tree/8711-last-replica-timeout-previous-attempt. That one didn't work well due to lock contention between I/O operations, replica error handling, and the new logic.