Use Echo method to send inactivity probe #368

pperiyasamy · 2023-09-19T11:57:27Z

This commit moves sendEcho call into goroutine to avoid chances for it gets blocked when nbdb leader is accidently disconnected, Otherwise there is no way to disconnect existing client connection and make it to connect with new leader.

pperiyasamy · 2023-09-19T11:58:21Z

/cc @jcaamano @dcbw

pperiyasamy · 2023-09-25T11:15:24Z

/hold

We may have to make send echo to be a time bound call as it currently holding rpc mutex's RLock for longer period as it takes almost 12 mins to return unexpected EOF error when nbdb leader is accidentally gone away.
During this period reconnect is not possible because Disconnect() seems waiting to acquire rpc mutex's Lock because of above mentioned reason.

pperiyasamy · 2023-09-25T15:49:32Z

We may have to make send echo to be a time bound call as it currently holding rpc mutex's RLock for longer period as it takes almost 12 mins to return unexpected EOF error when nbdb leader is accidentally gone away.
During this period reconnect is not possible because Disconnect() seems waiting to acquire rpc mutex's Lock because of above mentioned reason.

Using client's Echo method directly to fix the issue. Please take a look now.

jcaamano · 2023-09-25T16:09:27Z

client/client.go

-				o.Disconnect()
-				return
-			}
-			lastEcho = ""
 		case <-time.After(o.options.inactivityTimeout):


I think it is better to re-use the timer instead of using a different one on each for loop, otherwise we can end up with a lot of concurrent running timers if we are constantly seeing traffic

Isn't the timer eligible for garbage collection when select receives signal from trafficSeen channel ?

It is but if we spawn them at a greater rate than they are garbage collected (which requires the timer to complete first) then we have a problem. Seems too much of a risk to me when we could just re-use the same timer.

Ok @jcaamano , added a code now to reuse the same timer.

jcaamano · 2023-09-25T16:13:40Z

client/client.go

-			// Can't use o.Echo() because it blocks; we need the Call object direct from o.rpcClient.Go()
-			call := o.sendEcho(args, &reply)
-			if call == nil {
+			ctx, cancel := context.WithTimeout(context.Background(), o.options.timeout)


isn't this going to block ongoing transacts because we are not consuming the trafficSeen channel in the meantime?

Yes, It blocks ongoing transactions. Now calling Echo in a separate goroutine.

jcaamano · 2023-09-26T09:19:52Z

client/client.go

-						// Otherwise stuff thisEcho into the echoReplied channel
-						echoReplied <- thisEcho
-					}
+				ctx, cancel := context.WithTimeout(context.Background(), o.options.timeout)


two nuances:

Do we make any checks for o.options.timeoutto be smaller than o.options.inactivityTimeout? Or should we expect multiple instances of this goroutine to be running simultaneously?

We might see traffic while the echo is in progress and still disconnect if the echo itself fails.

I don't think any of these are necessarily a problem but can be disconcerting if we end up seeing logs and troubleshooting scenarios where any of that is happening.

Do we make any checks for o.options.timeoutto be smaller than o.options.inactivityTimeout? Or should we expect multiple instances of this goroutine to be running simultaneously?

We might see traffic while the echo is in progress and still disconnect if the echo itself fails.

The first concern is fixed with commit e0bed3f.

For the 2nd one, Do you think echo fails whereas transaction would succeed. Does that really happen (both goes via same tcp channel) ? Do you want to have some retries before going for a disconnect ? If this is not really a concern for now, we can skip doing it.

I think I am just happy if we check that o.options.timeout is smaller than o.options.inactivityTimeout. I would drop everything else from e0bed3f

Ok @jcaamano, done with commit c287d43 itself.
Hope you're still fine with cf04722 for go bump.

This commit uses client's Echo method to send inactivity probe as it can be made as a time bound call so that rpc read lock can not be held indefinitely. This would make disconnect happens immediately when current ovsdb leader is accidentally gone away (this happens in a very rare scenario in which case sendEcho method returns with unexpected EOF error after 12 mins only) and reconnects with new ovsdb leader. Signed-off-by: Periyasamy Palanisamy <[email protected]>

This bump up go into 1.19.6 and fixes lint errors generated in CI after upgrading go version. Signed-off-by: Periyasamy Palanisamy <[email protected]>

halfcrazy · 2023-11-08T06:41:46Z

client/client.go

+	if reflect.TypeOf(o.conn).String() == "*net.TCPConn" {
+		tcpConn := o.conn.(*net.TCPConn)
+		if o.options.timeout > 0 {
+			err := tcpConn.SetDeadline(time.Now().Add(o.options.timeout * 3))


Do we need to SetDeadline for each call? If so, there requires a mutex.

BTW: How does this affect rpcClient.CallWithContext? Does rpcClient discard the timeout context?

yes, The SetDeadline looks to be needed for every read/write on tcp channel. This method is always invoked after o.rpcMutex is acquired. so we're safe here.

correct, The rpcClient.CallWithContext method is not fully timebound call, so fixing it here based on feedback from a PR raised there.

I found a related issue in grpc-go maybe you can take a look.

issue: grpc/grpc#15889
pr: https://github.com/grpc/grpc-go/pull/2307/files

Thanks @halfcrazy , yes, it's the same issue that we are trying to fix in libovsdb client as well.
added a commit 3867774 to set up tcp user timeout.

@pperiyasamy have you considered @halfcrazy concerns about needing a mutex?

This sets up tcp connection deadline for write and read requests to fail after a deadline has been exceeded. Signed-off-by: Periyasamy Palanisamy <[email protected]>

This commit sets TCP_USER_TIMEOUT socket option for tcp connection so that channel write doesn't block indefinitely on network disconnect. Signed-off-by: Periyasamy Palanisamy <[email protected]>

jcaamano

@pperiyasamy Isn't the TCP user timeout an alternative to setDeadline, so we wouldn't need both? If not, are you able to reproduce a problem where one of the approaches is not enough and the other one is needed? I would stick with the TCP user timeout otherwise, I think.

jcaamano · 2023-11-20T14:06:04Z

client/client.go

+	// a rpc request to the server.
+	if tcpConn, ok := o.conn.(*net.TCPConn); ok {
+		if o.options.timeout > 0 {
+			err := tcpConn.SetDeadline(time.Now().Add(o.options.timeout * 3))


Why * 3? Wouldn't it make sense to use inactivityTimeout for this?

jcaamano · 2023-11-20T14:06:30Z

client/client.go

+	if reflect.TypeOf(o.conn).String() == "*net.TCPConn" {
+		tcpConn := o.conn.(*net.TCPConn)
+		if o.options.timeout > 0 {
+			err := tcpConn.SetDeadline(time.Now().Add(o.options.timeout * 3))


@pperiyasamy have you considered @halfcrazy concerns about needing a mutex?

jcaamano · 2023-11-20T14:07:08Z

client/client.go

+	// channel write doesn't block indefinitely on network disconnect.
+	var userTimeout time.Duration
+	if o.options.timeout > 0 {
+		userTimeout = o.options.timeout * 3


same here, why * 3? Wouldn't it make sense to use inactivityTimeout for this as well?

jcaamano · 2023-11-20T14:11:48Z

client/client.go

+	if o.options.timeout > 0 {
+		userTimeout = o.options.timeout * 3
+	} else {
+		userTimeout = defaultTimeout


I think that to start with we shouldn't have a default, and just set it to inactivityTimeout when that is set.

coveralls · 2024-01-16T23:51:39Z

Pull Request Test Coverage Report for Build 6325970834

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

29 of 40 (72.5%) changed or added relevant lines in 4 files are covered.
203 unchanged lines in 2 files lost coverage.
Overall coverage increased (+0.09%) to 76.07%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
ovsdb/schema.go	0	1	0.0%
client/options.go	1	3	33.33%
modelgen/generator.go	0	2	0.0%
client/client.go	28	34	82.35%

Files with Coverage Reduction	New Missed Lines	%
client/options.go	4	75.61%
client/client.go	199	74.58%

Totals
Change from base Build 6159565928:	0.09%
Covered Lines:	5280
Relevant Lines:	6941

💛 - Coveralls

pperiyasamy force-pushed the send-echo-async-for-inactivity-check branch 2 times, most recently from b50fa7f to 8cbc0ed Compare September 25, 2023 12:55

pperiyasamy changed the title ~~Move sendEcho into goroutine for inactivity probe~~ Use Echo method to send inactivity probe Sep 25, 2023

jcaamano reviewed Sep 25, 2023

View reviewed changes

pperiyasamy force-pushed the send-echo-async-for-inactivity-check branch 2 times, most recently from b865ddb to 49be92a Compare September 25, 2023 19:28

jcaamano reviewed Sep 26, 2023

View reviewed changes

pperiyasamy force-pushed the send-echo-async-for-inactivity-check branch 3 times, most recently from 464e26b to a1d471d Compare September 26, 2023 13:14

pperiyasamy added 2 commits September 27, 2023 14:30

Bump go 1.19.6

cf04722

This bump up go into 1.19.6 and fixes lint errors generated in CI after upgrading go version. Signed-off-by: Periyasamy Palanisamy <[email protected]>

pperiyasamy force-pushed the send-echo-async-for-inactivity-check branch from a1d471d to cf04722 Compare September 27, 2023 12:33

pperiyasamy force-pushed the send-echo-async-for-inactivity-check branch 2 times, most recently from e719ea9 to d569373 Compare November 8, 2023 06:15

halfcrazy reviewed Nov 8, 2023

View reviewed changes

Set deadline for tcp connection

06c1f43

This sets up tcp connection deadline for write and read requests to fail after a deadline has been exceeded. Signed-off-by: Periyasamy Palanisamy <[email protected]>

pperiyasamy force-pushed the send-echo-async-for-inactivity-check branch from d569373 to 304a102 Compare November 9, 2023 07:18

Set user timeout for tcp connection

3867774

This commit sets TCP_USER_TIMEOUT socket option for tcp connection so that channel write doesn't block indefinitely on network disconnect. Signed-off-by: Periyasamy Palanisamy <[email protected]>

pperiyasamy force-pushed the send-echo-async-for-inactivity-check branch from 304a102 to 3867774 Compare November 9, 2023 07:28

jcaamano reviewed Nov 20, 2023

View reviewed changes

This was referenced Oct 23, 2024

[Bug] Sometimes DeadLock when set inactivityTimeout > 0 #391

Open

[BUG] sometimes' pod always being status of ContainterCreating when set inactivityTimeout kubeovn/kube-ovn#4560

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Echo method to send inactivity probe #368

Use Echo method to send inactivity probe #368

pperiyasamy commented Sep 19, 2023

pperiyasamy commented Sep 19, 2023

pperiyasamy commented Sep 25, 2023

pperiyasamy commented Sep 25, 2023

jcaamano Sep 25, 2023

pperiyasamy Sep 25, 2023

jcaamano Sep 25, 2023

pperiyasamy Sep 25, 2023

jcaamano Sep 25, 2023

pperiyasamy Sep 25, 2023

jcaamano Sep 26, 2023

pperiyasamy Sep 26, 2023

jcaamano Sep 27, 2023

pperiyasamy Sep 27, 2023 •

edited

Loading

halfcrazy Nov 8, 2023

pperiyasamy Nov 8, 2023

halfcrazy Nov 8, 2023

pperiyasamy Nov 9, 2023

jcaamano Nov 20, 2023

jcaamano left a comment •

edited

Loading

jcaamano Nov 20, 2023

jcaamano Nov 20, 2023

jcaamano Nov 20, 2023 •

edited

Loading

jcaamano Nov 20, 2023

coveralls commented Jan 16, 2024 •

edited

Loading

Use Echo method to send inactivity probe #368

Are you sure you want to change the base?

Use Echo method to send inactivity probe #368

Conversation

pperiyasamy commented Sep 19, 2023

pperiyasamy commented Sep 19, 2023

pperiyasamy commented Sep 25, 2023

pperiyasamy commented Sep 25, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pperiyasamy Sep 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcaamano left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcaamano Nov 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Jan 16, 2024 • edited Loading

Pull Request Test Coverage Report for Build 6325970834

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

pperiyasamy Sep 27, 2023 •

edited

Loading

jcaamano left a comment •

edited

Loading

jcaamano Nov 20, 2023 •

edited

Loading

coveralls commented Jan 16, 2024 •

edited

Loading