Actually disconnect on timeout #172

Drahflow · 2015-04-21T09:46:55Z

lymph.core.connection.Connection pings the endpoint regularly
to determine whether it is still live. __init__ takes a
unresponsive_disconnect parameter, which was however ignored
so far. Similarly, the idle_disconnect parameter did not have
any effect.

This commit uses thees two parameters to do what they advertise,
i.e. close the connection once the respective timeouts have been
reached.

To this end, the greenlet management logic was also changed, such
that the monitoring greenlets now commit suicide when the connection
status == CLOSED instead of being explicitely killed from close().
Without this change, the monitoring greenlet would inadvertently kill
itself, leading to (very nonobvious) dead code and leaving a half-dead
connection object still registered with the server.

(edited as requested by @mouadino)

mouadino · 2015-04-21T09:55:29Z

@Drahflow can you ameliorate your commit description ? b/c sorry but I can't understand it.

Maybe start with "What is going on ?" "This lead to what ?" "How to fix it ?" .

lymph.core.connection.Connection pings the endpoint regularly to determine whether it is still live. `__init__` takes a `unresponsive_disconnect` parameter, which was however ignored so far. Similarly, the `idle_disconnect` parameter did not have any effect. This commit uses thees two parameters to do what they advertise, i.e. close the connection once the respective timeouts have been reached. To this end, the greenlet management logic was also changed, such that the monitoring greenlets now commit suicide when the connection `status == CLOSED` instead of being explicitely killed from `close()`. Without this change, the monitoring greenlet would inadvertently kill itself, leading to (very nonobvious) dead code and leaving a half-dead connection object still registered with the server.

emulbreh · 2015-04-21T13:50:24Z

This makes sense for me in general (it's basically what idle_disconnect and unresponsive_disconnect were meant to be used for initially).

emulbreh · 2015-04-21T13:51:06Z

lymph/core/connection.py

@@ -84,8 +84,14 @@ def live_check_loop(self):
    def update_status(self):
        if self.last_seen:
            now = time.monotonic()
-            if now - self.last_seen >= self.timeout:
+
+            if now - self.last_seen >= self.unresponsive_disconnect:


This timeout should only apply once the connection status is UNRESPONSIVE.

emulbreh · 2015-04-21T13:53:27Z

We should track the time since the last status change for disconnect timeouts.

Drahflow · 2015-04-23T09:01:20Z

Do we win anything by the status age tracking, except that the timeouts would be timeout / idle_timeout larger? It adds another member variable to the connection objects which adds some (if not much) complexity.

Philosophically speaking, it IMHO makes more sense to have the *_disconnect parameters denote the time since unresponsiveness / idleness began (which is indeed before the status change).

Since you designed this, I'll follow your suggestions, unless I hear differently from you.

mouadino · 2015-04-30T12:05:47Z

@emulbreh @Drahflow Why are we doing this ? removing a connection when it's idle is very weird to me ! i.e. if we don't talk to a given instance for more than 30 seconds we don't need the connection ? is this just some optimisation thing ?

And the main question how does this fit with zookeeper heartbeat ? aren't we already disconnecting from an instance when this later go down or the process is unresponsive ? beside how does we define unresponsiveness inside lymph ? Since I don't think we add anything on zookeeper heartbeat.

Beside this heartbeat thingy the more I think about it the more I dislike it since it assume a lot of stuff about the network where services are run ! e.g. timeout is 3 seconds i.e. in case of a latency burst we will mark connections as unresponsive which is false !

Drahflow · 2015-05-04T11:59:18Z

I personally don't care for disconnect on idle, it probably doesn't hurt, nor help much - I just implemented the logic for the already existing constructor parameters.

However, I do care a lot for disconnect on unresponsiveness:

If the remote end of the connection is not managed by zookeeper, e.g. it is a lymph request, the connection needs to close after a finite time, otherwise we'll keep accumulating greenlets which try to ping the remote end. This was my original motivation to look into this at all. This needs to be fixed one way or another. Explicitely handling "remote end closed connection" would work as well (and quicker), but this PR solves the "endpoint is too slow" case using the same code, which I think is better.
There are a lot of reasions why some TCP/IP connection might become slow. Not all of them are covered by sending zookeeper heartbeats. IMHO the one thing we care about is timely replies to our rpc requests. The connection object is in the best position to notice that such replies are not coming (because it's measuring delay using the same connection the requests use). If they are not coming, we should stop using the slow instance.

mouadino · 2015-05-06T08:20:29Z

The case where services are not managed by any service discovery where user have to magically know where a service is, defies the purpose of service oriented architecture and a design problem which always bothered me in lymph, but if the plan is to keep it then maybe we should just enable heartbeat for this case only, but again if a service is accessed by an ip + port than we are mostly talking one instance, so why again we need connection heartbeat ? FWIW currently this is only used for special services like node.
We should not close connection b/c endpoint is too slow, there is a timeout set in the proxy side that will raise an error if a request take more than this later, that's the only place where connection is too slow matter. As for having some QoS of each instance based only on if an instance can respond to a ping or no this way is broken by design, a real solution is to implement the circuit breaker pattern, which we hope we can do soon.

What do you mean by "Not all of them are covered by sending zookeeper heartbeats" ? can you elaborate ? I don't see how our heartbeat algorithm is anyway more sophisticated than zookeeper heartbeat.

AFAIK heartbeat is used to check process liveness and not for checking if there is a latency burst or no, b/c something that the later will happen and you don't want to just stop and fail all request just b/c your network is having a bad moment.

Thoughts ?

Drahflow · 2015-05-08T07:13:38Z

Re 1.: It's about the other way around. We have a usual zookeeper-registered instance A. Then I do lymph request to connect to A and do some stuff. Afterwards the connection objects of A tries to ping my lymph request and never stops...

Re 2.: Nothing against the circuit breaker. But also the breaker needs some event to trigger on. Do you propose to use the proxy timeouts for this?

What I meant by "not everything is covered by zookeeper": I think we are sending regular zookeeper heartbeats via some greenlet or whatever. However this effectively only detects whether our entire process or machine is overloaded, not whether we are experiencing inacceptable latency on a connection. (The former will imply the latter, but not the other way around.) Consider we are having some request which takes, say 100ms to serve (instead of the expected 1ms). Now the client code does 1000 of them into the same connection. Afterwards, it'll send a ping. After 1sec, the service will have completed 1% of the requests, and send a zookeeper heartbeat. Only after 100sec is the service ready to be useful again... Combine this with asynchronous requests on the client and I think the next request of the client should go somewhere else, if possible. Same problem if something ugly happened on the TCP level, like network outage only between the client and the service, but not between each of them and zookeeper.

mouadino · 2015-05-08T15:55:12Z

Yes I understood this part b/c we have a two way heartbeat i.e. client -> server and vice versa, which is IMHO something that I am still trying to figure out why we do it in the first place ! IMHO it should be done from the client side only unless I am missing something.
circuit breaker will be triggered with more than just timeout, it can also be triggered by raising specific errors when resources are exhausted or when an instance of a service raise a lot of errors comparing to other (e.g. one instance of a service having problem to connect to it backend database and not the other instances), this is the good thing about circuit breaker and this why I think that the connection layer is the wrong layer to implement this.

Yes, kazoo do the heartbeat in another greenlet, and your calculation are if I understand them correctly missing the fact that requests are run concurrently so is the heartbeats of both kazoo and lymph connections, since each run in it's own greenlet, beside that I couldn't follow very well your math, too much numbers for my taste :).

As for network outrage, I don't think it apply here, since any real setup should make sure that zookeeper and instances live in the same network, since if there is a problem with communication between an instance and zookeeper, then the instance is already dead from the all point of views.

Drahflow force-pushed the 2015-04-21-disconnect-on-timeout branch from cf3e737 to 3584433 Compare April 21, 2015 10:06

emulbreh reviewed Apr 21, 2015
View reviewed changes

Different timeout semantics as per @emulbreh.

b4d5b3a

Drahflow force-pushed the 2015-04-21-disconnect-on-timeout branch from c9391d5 to b4d5b3a Compare April 24, 2015 11:29

Drahflow mentioned this pull request May 26, 2015

Handling of connection failures #196

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Actually disconnect on timeout #172

Actually disconnect on timeout #172

Drahflow commented Apr 21, 2015

mouadino commented Apr 21, 2015

emulbreh commented Apr 21, 2015

emulbreh Apr 21, 2015

emulbreh commented Apr 21, 2015

Drahflow commented Apr 23, 2015

mouadino commented Apr 30, 2015

Drahflow commented May 4, 2015

mouadino commented May 6, 2015

Drahflow commented May 8, 2015

mouadino commented May 8, 2015

Actually disconnect on timeout #172

Are you sure you want to change the base?

Actually disconnect on timeout #172

Conversation

Drahflow commented Apr 21, 2015

mouadino commented Apr 21, 2015

emulbreh commented Apr 21, 2015

emulbreh Apr 21, 2015

Choose a reason for hiding this comment

emulbreh commented Apr 21, 2015

Drahflow commented Apr 23, 2015

mouadino commented Apr 30, 2015

Drahflow commented May 4, 2015

mouadino commented May 6, 2015

Drahflow commented May 8, 2015

mouadino commented May 8, 2015