feat: support a different timeout for the last replica #1176

ejweber · 2024-08-06T20:06:38Z

Which issue(s) this PR fixes:

What this PR does / why we need it:

The implementation in this PR makes it possible to configure the engine so that there are two engineReplicaTimeouts (short and long). The backends lightly coordinate via a new SharedTimeouts struct to ensure that most of them can time out in the normal way after engineReplicaTimeoutShort, but exactly one of them must wait engineReplicaTimeoutLong to do the same.

Note that this PR does NOT actually configure a different engineReplicaTimeoutLong. My plan is to do that in a followup after this one is approved and we decide exactly how we want to expose the new capability.

Special notes for your reviewer:

Additional documentation or context

Per #1176 (comment), I experimented with a different approach in https://github.com/ejweber/longhorn-engine/tree/8711-last-replica-timeout-previous-attempt. That one didn't work well due to lock contention between I/O operations, replica error handling, and the new logic.

ejweber · 2024-08-07T19:23:28Z

UPDATE: Keeping this comment for historical purpose, but it is no longer relevant. The approach in this PR is now much different.

I had to enable pprof on the engine to understand why this wasn't working the way I wanted up until at least b8e8d7a. The reason is this:

I put loop in the controller that checks all existing backends to see if they have exceeded the "short" timeout every second.
This check requires a controller RLock for safety.
Once the controller identifies a backend that should time out, it notifies the backend via a channel.
The backend responds by "timing itself out" in the way Longhorn backends usually do.
The controller should handle this by setting the backend to ERR, but this requires a Lock.
HOWEVER, the pending writes already all have an RLock.

In a test case in which I blocked communication with all replicas simultaneously:

A bunch of writes had an Rlock.
The first backend the new logic failed waited on a Lock util the writes completed.
The new logic couldn't fail any more backends until it obtained an RLock.
The writes never finished (and released the Rlock), because there were still unfailed backends.

12 @ 0x441d4e 0x454bc5 0x454b94 0x475845 0x498b88 0xa7ad2a 0xa7dfff 0xa7703d 0xa77018 0xa6e212 0xa5a15c 0x479861
#	0x475844	sync.runtime_Semacquire+0x24									/usr/lib64/go/1.22/src/runtime/sema.go:62
#	0x498b87	sync.(*WaitGroup).Wait+0x47									/usr/lib64/go/1.22/src/sync/waitgroup.go:116
#	0xa7ad29	github.com/longhorn/longhorn-engine/pkg/controller.(*MultiWriterAt).WriteAt+0x1e9		/go/src/github.com/longhorn/longhorn-engine/pkg/controller/multi_writer_at.go:52
#	0xa7dffe	github.com/longhorn/longhorn-engine/pkg/controller.(*replicator).WriteAt+0x3e			/go/src/github.com/longhorn/longhorn-engine/pkg/controller/replicator.go:129
# ----> The rest of the stack is called with an RLock.
#	0xa7703c	github.com/longhorn/longhorn-engine/pkg/controller.(*Controller).writeInNormalMode+0x21c	/go/src/github.com/longhorn/longhorn-engine/pkg/controller/control.go:1095
#	0xa77017	github.com/longhorn/longhorn-engine/pkg/controller.(*Controller).WriteAt+0x1f7			/go/src/github.com/longhorn/longhorn-engine/pkg/controller/control.go:1050
#	0xa6e211	github.com/longhorn/longhorn-engine/pkg/frontend/socket.DataProcessorWrapper.WriteAt+0x31	/go/src/github.com/longhorn/longhorn-engine/pkg/frontend/socket/frontend.go:160
#	0xa5a15b	github.com/longhorn/longhorn-engine/pkg/dataconn.(*Server).handleWrite+0x3b			/go/src/github.com/longhorn/longhorn-engine/pkg/dataconn/server.go:88

1 @ 0x441d4e 0x454bc5 0x454b94 0x475965 0xa7208f 0xa7206b 0xaa93cd 0xaa9693 0x9ce2ce 0xaa9338 0x9ce123 0x9b45f8 0x9b94cb 0x9b26ab 0x479861
#	0x475964	sync.runtime_SemacquireRWMutexR+0x24																			/usr/lib64/go/1.22/src/runtime/sema.go:82
#	0xa7208e	sync.(*RWMutex).RLock+0x4e																				/usr/lib64/go/1.22/src/sync/rwmutex.go:70
#	0xa7206a	github.com/longhorn/longhorn-engine/pkg/controller.(*Controller).GetExpansionErrorInfo+0x2a												/go/src/github.com/longhorn/longhorn-engine/pkg/controller/control.go:427
#	0xaa93cc	github.com/longhorn/longhorn-engine/pkg/controller/rpc.(*ControllerServer).getVolume+0x2c												/go/src/github.com/longhorn/longhorn-engine/pkg/controller/rpc/server.go:85
#	0xaa9692	github.com/longhorn/longhorn-engine/pkg/controller/rpc.(*ControllerServer).VolumeGet+0x12												/go/src/github.com/longhorn/longhorn-engine/pkg/controller/rpc/server.go:122
#	0x9ce2cd	github.com/longhorn/types/pkg/generated/enginerpc._ControllerService_VolumeGet_Handler.func1+0xcd											/go/src/github.com/longhorn/longhorn-engine/vendor/github.com/longhorn/types/pkg/generated/enginerpc/controller_grpc.pb.go:391
#	0xaa9337	github.com/longhorn/longhorn-engine/pkg/controller/rpc.GetControllerGRPCServer.WithIdentityValidationControllerServerInterceptor.identityValidationServerInterceptor.func1+0x797	/go/src/github.com/longhorn/longhorn-engine/pkg/interceptor/interceptor.go:64
#	0x9ce122	github.com/longhorn/types/pkg/generated/enginerpc._ControllerService_VolumeGet_Handler+0x142												/go/src/github.com/longhorn/longhorn-engine/vendor/github.com/longhorn/types/pkg/generated/enginerpc/controller_grpc.pb.go:393
#	0x9b45f7	google.golang.org/grpc.(*Server).processUnaryRPC+0xdf7																	/go/src/github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/server.go:1379
#	0x9b94ca	google.golang.org/grpc.(*Server).handleStream+0xe8a																	/go/src/github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/server.go:1790
#	0x9b26aa	google.golang.org/grpc.(*Server).serveStreams.func2.1+0x8a																/go/src/github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/server.go:1029

1 @ 0x441d4e 0x454bc5 0x454b94 0x475965 0xa79a1b 0xa799f4 0xa798b2 0x479861
#	0x475964	sync.runtime_SemacquireRWMutexR+0x24									/usr/lib64/go/1.22/src/runtime/sema.go:82
#	0xa79a1a	sync.(*RWMutex).RLock+0xda										/usr/lib64/go/1.22/src/sync/rwmutex.go:70
#	0xa799f3	github.com/longhorn/longhorn-engine/pkg/controller.(*Controller).checkBackendTimeouts+0xb3		/go/src/github.com/longhorn/longhorn-engine/pkg/controller/control.go:1446
#	0xa798b1	github.com/longhorn/longhorn-engine/pkg/controller.(*Controller).monitorBackendTimeouts.func1+0x111	/go/src/github.com/longhorn/longhorn-engine/pkg/controller/control.go:1437

1 @ 0x441d4e 0x454bc5 0x454b94 0x4759c5 0x49878a 0xa72d79 0xa7887a 0x479861
#	0x4759c4	sync.runtime_SemacquireRWMutex+0x24							/usr/lib64/go/1.22/src/runtime/sema.go:87
#	0x498789	sync.(*RWMutex).Lock+0x69								/usr/lib64/go/1.22/src/sync/rwmutex.go:151
#	0xa72d78	github.com/longhorn/longhorn-engine/pkg/controller.(*Controller).SetReplicaMode+0x178	/go/src/github.com/longhorn/longhorn-engine/pkg/controller/control.go:514
#	0xa78879	github.com/longhorn/longhorn-engine/pkg/controller.(*Controller).monitoring+0x199	/go/src/github.com/longhorn/longhorn-engine/pkg/controller/control.go:1280

NOTE: Go's locks are write biased. Once a writer attempts to acquire the lock, no new readers can acquire it until after the writer has.

james-munson

LGTM.

Longhorn 8711 Signed-off-by: Eric Weber <[email protected]>

…outShort Longhorn 8711 Signed-off-by: Eric Weber <[email protected]>

ejweber · 2024-08-21T19:30:05Z

I ended up making the long timeout double the short one and not adjusting the potential range of the short one. This makes the timeouts 8s / 16s out-of-the-box with the ability to increase them to 30s / 60s with the caveat from longhorn/longhorn#8711 (comment).

james-munson

I'll buy that.

derekbit

LGTM. Only two comments.

pkg/backend/remote/remote.go

pkg/dataconn/client.go

PhanLe1010

LGTM

Thank you for the good additional refactoring on the timeOfLastActivity logic. It is easier to understand compared to the previous logic

derekbit · 2024-08-23T15:26:46Z

@mergify backport v1.7.x v1.6.x

mergify · 2024-08-23T15:26:56Z

backport v1.7.x v1.6.x

✅ Backports have been created

#1198 feat: support a different timeout for the last replica (backport #1176) has been created for branch v1.7.x
#1199 feat: support a different timeout for the last replica (backport #1176) has been created for branch v1.6.x

ejweber changed the title ~~8711 last replica timeout~~ Support a different timeout for the last replica Aug 6, 2024

ejweber force-pushed the 8711-last-replica-timeout branch from 7d49593 to 5e0858b Compare August 6, 2024 20:07

ejweber changed the title ~~Support a different timeout for the last replica~~ feat: support a different timeout for the last replica Aug 6, 2024

ejweber force-pushed the 8711-last-replica-timeout branch 4 times, most recently from 828f7aa to 315faca Compare August 8, 2024 22:04

ejweber marked this pull request as ready for review August 8, 2024 22:08

james-munson previously approved these changes Aug 12, 2024

View reviewed changes

ejweber dismissed james-munson’s stale review via f0077e3 August 21, 2024 19:07

ejweber added 3 commits August 21, 2024 19:07

feat(timeout): enable different timeout for last replica

6f050a1

Longhorn 8711 Signed-off-by: Eric Weber <[email protected]>

fix(datconn): don't orphan client goroutine when remote is closed

109aaf7

Longhorn 8711 Signed-off-by: Eric Weber <[email protected]>

feat(timeout): make engineReplicaTimeoutLong double engineReplicaTime…

0e9215b

…outShort Longhorn 8711 Signed-off-by: Eric Weber <[email protected]>

ejweber force-pushed the 8711-last-replica-timeout branch from f0077e3 to 0e9215b Compare August 21, 2024 19:08

ejweber mentioned this pull request Aug 21, 2024

[IMPROVEMENT] Resilience handling for the last replica timeout longhorn/longhorn#8711

Closed

2 tasks

james-munson approved these changes Aug 22, 2024

View reviewed changes

derekbit reviewed Aug 23, 2024

View reviewed changes

pkg/backend/remote/remote.go Show resolved Hide resolved

pkg/dataconn/client.go Show resolved Hide resolved

PhanLe1010 approved these changes Aug 23, 2024

View reviewed changes

mergify bot merged commit 405e96f into longhorn:master Aug 23, 2024
12 checks passed

ejweber mentioned this pull request Aug 23, 2024

refactor(imports): follow convention in dataconn/client.go #1197

Merged

This was referenced Aug 23, 2024

feat: support a different timeout for the last replica (backport #1176) #1198

Merged

feat: support a different timeout for the last replica (backport #1176) #1199

Closed

derekbit mentioned this pull request Aug 26, 2024

[BACKPORT][v1.6.3][IMPROVEMENT] Resilience handling for the last replica timeout longhorn/longhorn#9274

Closed

mergify bot mentioned this pull request Aug 26, 2024

refactor(imports): follow convention in dataconn/client.go (backport #1197) #1205

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support a different timeout for the last replica #1176

feat: support a different timeout for the last replica #1176

ejweber commented Aug 6, 2024 •

edited

Loading

ejweber commented Aug 7, 2024 •

edited

Loading

james-munson left a comment

ejweber commented Aug 21, 2024

james-munson left a comment

derekbit left a comment

PhanLe1010 left a comment •

edited

Loading

derekbit commented Aug 23, 2024

mergify bot commented Aug 23, 2024 •

edited

Loading

feat: support a different timeout for the last replica #1176

feat: support a different timeout for the last replica #1176

Conversation

ejweber commented Aug 6, 2024 • edited Loading

Which issue(s) this PR fixes:

What this PR does / why we need it:

Special notes for your reviewer:

Additional documentation or context

ejweber commented Aug 7, 2024 • edited Loading

james-munson left a comment

Choose a reason for hiding this comment

ejweber commented Aug 21, 2024

james-munson left a comment

Choose a reason for hiding this comment

derekbit left a comment

Choose a reason for hiding this comment

PhanLe1010 left a comment • edited Loading

Choose a reason for hiding this comment

derekbit commented Aug 23, 2024

mergify bot commented Aug 23, 2024 • edited Loading

✅ Backports have been created

ejweber commented Aug 6, 2024 •

edited

Loading

ejweber commented Aug 7, 2024 •

edited

Loading

PhanLe1010 left a comment •

edited

Loading

mergify bot commented Aug 23, 2024 •

edited

Loading