-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Echo method to send inactivity probe #368
base: main
Are you sure you want to change the base?
Use Echo method to send inactivity probe #368
Conversation
/hold We may have to make send echo to be a time bound call as it currently holding rpc mutex's RLock for longer period as it takes almost 12 mins to return |
b50fa7f
to
8cbc0ed
Compare
Using client's Echo method directly to fix the issue. Please take a look now. |
client/client.go
Outdated
o.Disconnect() | ||
return | ||
} | ||
lastEcho = "" | ||
case <-time.After(o.options.inactivityTimeout): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to re-use the timer instead of using a different one on each for loop, otherwise we can end up with a lot of concurrent running timers if we are constantly seeing traffic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the timer eligible for garbage collection when select receives signal from trafficSeen channel ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is but if we spawn them at a greater rate than they are garbage collected (which requires the timer to complete first) then we have a problem. Seems too much of a risk to me when we could just re-use the same timer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok @jcaamano , added a code now to reuse the same timer.
client/client.go
Outdated
// Can't use o.Echo() because it blocks; we need the Call object direct from o.rpcClient.Go() | ||
call := o.sendEcho(args, &reply) | ||
if call == nil { | ||
ctx, cancel := context.WithTimeout(context.Background(), o.options.timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this going to block ongoing transacts because we are not consuming the trafficSeen channel in the meantime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, It blocks ongoing transactions. Now calling Echo in a separate goroutine.
b865ddb
to
49be92a
Compare
// Otherwise stuff thisEcho into the echoReplied channel | ||
echoReplied <- thisEcho | ||
} | ||
ctx, cancel := context.WithTimeout(context.Background(), o.options.timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two nuances:
- Do we make any checks for
o.options.timeout
to be smaller thano.options.inactivityTimeout
? Or should we expect multiple instances of this goroutine to be running simultaneously? - We might see traffic while the echo is in progress and still disconnect if the echo itself fails.
I don't think any of these are necessarily a problem but can be disconcerting if we end up seeing logs and troubleshooting scenarios where any of that is happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Do we make any checks for
o.options.timeout
to be smaller thano.options.inactivityTimeout
? Or should we expect multiple instances of this goroutine to be running simultaneously?- We might see traffic while the echo is in progress and still disconnect if the echo itself fails.
The first concern is fixed with commit e0bed3f.
For the 2nd one, Do you think echo fails whereas transaction would succeed. Does that really happen (both goes via same tcp channel) ? Do you want to have some retries before going for a disconnect ? If this is not really a concern for now, we can skip doing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I am just happy if we check that o.options.timeout
is smaller than o.options.inactivityTimeout
. I would drop everything else from e0bed3f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
464e26b
to
a1d471d
Compare
This commit uses client's Echo method to send inactivity probe as it can be made as a time bound call so that rpc read lock can not be held indefinitely. This would make disconnect happens immediately when current ovsdb leader is accidentally gone away (this happens in a very rare scenario in which case sendEcho method returns with unexpected EOF error after 12 mins only) and reconnects with new ovsdb leader. Signed-off-by: Periyasamy Palanisamy <[email protected]>
This bump up go into 1.19.6 and fixes lint errors generated in CI after upgrading go version. Signed-off-by: Periyasamy Palanisamy <[email protected]>
a1d471d
to
cf04722
Compare
e719ea9
to
d569373
Compare
if reflect.TypeOf(o.conn).String() == "*net.TCPConn" { | ||
tcpConn := o.conn.(*net.TCPConn) | ||
if o.options.timeout > 0 { | ||
err := tcpConn.SetDeadline(time.Now().Add(o.options.timeout * 3)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to SetDeadline
for each call? If so, there requires a mutex.
BTW: How does this affect rpcClient.CallWithContext
? Does rpcClient
discard the timeout context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, The SetDeadline
looks to be needed for every read/write on tcp channel. This method is always invoked after o.rpcMutex is acquired. so we're safe here.
correct, The rpcClient.CallWithContext
method is not fully timebound call, so fixing it here based on feedback from a PR raised there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a related issue in grpc-go
maybe you can take a look.
issue: grpc/grpc#15889
pr: https://github.com/grpc/grpc-go/pull/2307/files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @halfcrazy , yes, it's the same issue that we are trying to fix in libovsdb client as well.
added a commit 3867774 to set up tcp user timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pperiyasamy have you considered @halfcrazy concerns about needing a mutex?
This sets up tcp connection deadline for write and read requests to fail after a deadline has been exceeded. Signed-off-by: Periyasamy Palanisamy <[email protected]>
d569373
to
304a102
Compare
This commit sets TCP_USER_TIMEOUT socket option for tcp connection so that channel write doesn't block indefinitely on network disconnect. Signed-off-by: Periyasamy Palanisamy <[email protected]>
304a102
to
3867774
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pperiyasamy Isn't the TCP user timeout an alternative to setDeadline, so we wouldn't need both? If not, are you able to reproduce a problem where one of the approaches is not enough and the other one is needed? I would stick with the TCP user timeout otherwise, I think.
// a rpc request to the server. | ||
if tcpConn, ok := o.conn.(*net.TCPConn); ok { | ||
if o.options.timeout > 0 { | ||
err := tcpConn.SetDeadline(time.Now().Add(o.options.timeout * 3)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why * 3
? Wouldn't it make sense to use inactivityTimeout
for this?
if reflect.TypeOf(o.conn).String() == "*net.TCPConn" { | ||
tcpConn := o.conn.(*net.TCPConn) | ||
if o.options.timeout > 0 { | ||
err := tcpConn.SetDeadline(time.Now().Add(o.options.timeout * 3)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pperiyasamy have you considered @halfcrazy concerns about needing a mutex?
// channel write doesn't block indefinitely on network disconnect. | ||
var userTimeout time.Duration | ||
if o.options.timeout > 0 { | ||
userTimeout = o.options.timeout * 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, why * 3? Wouldn't it make sense to use inactivityTimeout for this as well?
if o.options.timeout > 0 { | ||
userTimeout = o.options.timeout * 3 | ||
} else { | ||
userTimeout = defaultTimeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that to start with we shouldn't have a default, and just set it to inactivityTimeout
when that is set.
Pull Request Test Coverage Report for Build 6325970834Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
This commit moves sendEcho call into goroutine to avoid chances for it gets blocked when nbdb leader is accidently disconnected, Otherwise there is no way to disconnect existing client connection and make it to connect with new leader.