[Healthcheck] Protocol dependant checks #61

johnsudaar · 2019-07-23T14:06:06Z

Health check should be protocol dependant.

If we're using LinK for a distributed PGSQL we should have a PGSQL health check. TCP check is not enough (especially if it's proxied by another service (haproxy, nginx)).

Soulou · 2019-07-23T14:09:32Z

I disagree here, LINK role is not to ping the backend, but to check that haproxy is present. If the backend is down, it means that the IP will completely disappear, no one will use it, and apps will get 'no route to host'

That's not what we want. We want to reach one HA proxy (having the IP) which will keep the connection for up to 60 seconds (default configuration) and forward them to the backend once it's up again.

johnsudaar · 2019-07-23T14:16:18Z

LinK goal is to manage an IP and failover if its backend is not able to perform its work.

If there's not backend available the current host is not healthy and should not take the IP.

Soulou · 2019-07-23T14:18:05Z

But it's not the behavior we want, you're going to drop more connections and create more unavailability doing this.

Soulou · 2019-07-23T14:18:55Z

And in our use case, the backend is an HAProxy instance, not what HAProxy is proxying to, this has to be monitored differently.

johnsudaar · 2019-07-23T14:19:16Z

We can add a TimeBeforeFail on the healthcheck if that's what worrying you (and set it to 70s on PGSQL).

Soulou · 2019-07-23T14:19:40Z

Not only PGSQL, just all of them

Soulou · 2019-07-23T14:21:15Z

But still it's not a TimeBeforeFail that we want actually.

DB fails
50 seconds laters user restart its app, new connections arrive
I want them to reach haproxy which will keep them 60 seconds and retry the backend during this period

-> So it's not even 70s, it's just to have something handling the connections.

johnsudaar · 2019-07-23T14:24:18Z

Okay and if there is a SAND issue on the host and on one host the connection between HAProxy and the nodes is broken we will never failover even if we've a prefectly healthy host on the cluster.

johnsudaar · 2019-07-23T14:25:44Z

The timeout and crash thing is an edgecase and if the DB is down for more than 60s the user will see errors nevertheless. What's the point of routing IP to a non functionning server ? I get the master failover part where if it's down for less than 60s we do not want to failover. But in your precedent example the user will still see errors.

Soulou · 2019-07-23T14:31:49Z

To me it's much much less an edgecase than Linux failing the vxlan networking stack. (It's not SAND, SAND is just doing the setup, nothing else then)

I don't see the user won't see the errors if the down is long, but if it's a transient 30 seconds errors related to load, it would be almost transparent (except the delay itself) and then the connections would be transmitted to the backend.

We would route to a fully working IP, which is accepting connection and waiting for a backend to be ready. (all proxies would be in that state) I prefer having clients risking their chance getting a connection and waiting 60 seconds, than dropping all the packets by default.

johnsudaar · 2019-07-23T14:35:09Z

But after 60s you would drop the packet nevertheless. If it's a 30s errors related to load we wont failover because 30s is less than the 60s configured.

The IP is not fully working if there's not backend behind it.

johnsudaar · 2019-07-23T14:37:34Z

Plus it's not only the vxlan networking stack. It could be also networking between the host running the HAProxy and the host running the current master pgsql db.

Soulou · 2019-07-23T14:41:41Z

Yes we would drop the packet after 60 seconds, but it would have been retried during those 60 seconds, increasing the chances that it gets through the backend compared if we would have dropped it at once.

if it's a 180s incident related to load, that 10 connections arrive at 50s and 10 connections are created at 160s, 10 connections would go through, you would drop the 20 of them

I think if the networking falls between hosts of a virtual infrastructure, this would be the least of our problems...

EtienneM · 2019-07-23T14:43:05Z

I think that LinK health checks should only be about HAProxy and not what is behind for all the reasons @Soulou said. But @johnsudaar got a point: if there is a network issue between HAProxy and the backend, the LinK IP should go to another HAProxy.

For that, we could add another port on HAProxy with a new service running which is a health check service. LinK agent queries the HAProxy health check port. The service can, for instance, create a TCP connection to the backend in order to see if there is an issue there.

With this solution, LinK only health check for HAProxy healthiness, doesn't it?

Soulou · 2019-07-23T14:44:21Z

For that, we could add another port on HAProxy with a new service running which is a health check service. LinK agent query the HAProxy health check port. The service can, for instance, create a TCP connection to the backend in order to see if there is an issue there.

Not a TCP connection (it would get the same about reaching the backend and having weird log lines), but an ICMP ping for instance

johnsudaar changed the title ~~[Healthcheck] Protocol dependent checks~~ [Healthcheck] Protocol dependant checks Jul 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Healthcheck] Protocol dependant checks #61

[Healthcheck] Protocol dependant checks #61

johnsudaar commented Jul 23, 2019

Soulou commented Jul 23, 2019

johnsudaar commented Jul 23, 2019

Soulou commented Jul 23, 2019

Soulou commented Jul 23, 2019

johnsudaar commented Jul 23, 2019

Soulou commented Jul 23, 2019

Soulou commented Jul 23, 2019 •

edited

Loading

johnsudaar commented Jul 23, 2019 •

edited by Soulou

Loading

johnsudaar commented Jul 23, 2019

Soulou commented Jul 23, 2019

johnsudaar commented Jul 23, 2019

johnsudaar commented Jul 23, 2019

Soulou commented Jul 23, 2019

EtienneM commented Jul 23, 2019 •

edited

Loading

Soulou commented Jul 23, 2019

[Healthcheck] Protocol dependant checks #61

[Healthcheck] Protocol dependant checks #61

Comments

johnsudaar commented Jul 23, 2019

Soulou commented Jul 23, 2019

johnsudaar commented Jul 23, 2019

Soulou commented Jul 23, 2019

Soulou commented Jul 23, 2019

johnsudaar commented Jul 23, 2019

Soulou commented Jul 23, 2019

Soulou commented Jul 23, 2019 • edited Loading

johnsudaar commented Jul 23, 2019 • edited by Soulou Loading

johnsudaar commented Jul 23, 2019

Soulou commented Jul 23, 2019

johnsudaar commented Jul 23, 2019

johnsudaar commented Jul 23, 2019

Soulou commented Jul 23, 2019

EtienneM commented Jul 23, 2019 • edited Loading

Soulou commented Jul 23, 2019

Soulou commented Jul 23, 2019 •

edited

Loading

johnsudaar commented Jul 23, 2019 •

edited by Soulou

Loading

EtienneM commented Jul 23, 2019 •

edited

Loading