Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Healthcheck] Protocol dependant checks #61

Open
johnsudaar opened this issue Jul 23, 2019 · 15 comments
Open

[Healthcheck] Protocol dependant checks #61

johnsudaar opened this issue Jul 23, 2019 · 15 comments

Comments

@johnsudaar
Copy link

Health check should be protocol dependant.

If we're using LinK for a distributed PGSQL we should have a PGSQL health check. TCP check is not enough (especially if it's proxied by another service (haproxy, nginx)).

@johnsudaar johnsudaar changed the title [Healthcheck] Protocol dependent checks [Healthcheck] Protocol dependant checks Jul 23, 2019
@Soulou
Copy link
Member

Soulou commented Jul 23, 2019

I disagree here, LINK role is not to ping the backend, but to check that haproxy is present. If the backend is down, it means that the IP will completely disappear, no one will use it, and apps will get 'no route to host'

That's not what we want. We want to reach one HA proxy (having the IP) which will keep the connection for up to 60 seconds (default configuration) and forward them to the backend once it's up again.

@johnsudaar
Copy link
Author

LinK goal is to manage an IP and failover if its backend is not able to perform its work.

If there's not backend available the current host is not healthy and should not take the IP.

@Soulou
Copy link
Member

Soulou commented Jul 23, 2019

But it's not the behavior we want, you're going to drop more connections and create more unavailability doing this.

@Soulou
Copy link
Member

Soulou commented Jul 23, 2019

And in our use case, the backend is an HAProxy instance, not what HAProxy is proxying to, this has to be monitored differently.

@johnsudaar
Copy link
Author

We can add a TimeBeforeFail on the healthcheck if that's what worrying you (and set it to 70s on PGSQL).

@Soulou
Copy link
Member

Soulou commented Jul 23, 2019

Not only PGSQL, just all of them

@Soulou
Copy link
Member

Soulou commented Jul 23, 2019

But still it's not a TimeBeforeFail that we want actually.

  1. DB fails
  2. 50 seconds laters user restart its app, new connections arrive
  3. I want them to reach haproxy which will keep them 60 seconds and retry the backend during this period

-> So it's not even 70s, it's just to have something handling the connections.

@johnsudaar
Copy link
Author

johnsudaar commented Jul 23, 2019

Okay and if there is a SAND issue on the host and on one host the connection between HAProxy and the nodes is broken we will never failover even if we've a prefectly healthy host on the cluster.

@johnsudaar
Copy link
Author

The timeout and crash thing is an edgecase and if the DB is down for more than 60s the user will see errors nevertheless. What's the point of routing IP to a non functionning server ? I get the master failover part where if it's down for less than 60s we do not want to failover. But in your precedent example the user will still see errors.

@Soulou
Copy link
Member

Soulou commented Jul 23, 2019

To me it's much much less an edgecase than Linux failing the vxlan networking stack. (It's not SAND, SAND is just doing the setup, nothing else then)

I don't see the user won't see the errors if the down is long, but if it's a transient 30 seconds errors related to load, it would be almost transparent (except the delay itself) and then the connections would be transmitted to the backend.

We would route to a fully working IP, which is accepting connection and waiting for a backend to be ready. (all proxies would be in that state) I prefer having clients risking their chance getting a connection and waiting 60 seconds, than dropping all the packets by default.

@johnsudaar
Copy link
Author

But after 60s you would drop the packet nevertheless. If it's a 30s errors related to load we wont failover because 30s is less than the 60s configured.

The IP is not fully working if there's not backend behind it.

@johnsudaar
Copy link
Author

Plus it's not only the vxlan networking stack. It could be also networking between the host running the HAProxy and the host running the current master pgsql db.

@Soulou
Copy link
Member

Soulou commented Jul 23, 2019

Yes we would drop the packet after 60 seconds, but it would have been retried during those 60 seconds, increasing the chances that it gets through the backend compared if we would have dropped it at once.

if it's a 180s incident related to load, that 10 connections arrive at 50s and 10 connections are created at 160s, 10 connections would go through, you would drop the 20 of them

I think if the networking falls between hosts of a virtual infrastructure, this would be the least of our problems...

@EtienneM
Copy link
Member

EtienneM commented Jul 23, 2019

I think that LinK health checks should only be about HAProxy and not what is behind for all the reasons @Soulou said. But @johnsudaar got a point: if there is a network issue between HAProxy and the backend, the LinK IP should go to another HAProxy.

For that, we could add another port on HAProxy with a new service running which is a health check service. LinK agent queries the HAProxy health check port. The service can, for instance, create a TCP connection to the backend in order to see if there is an issue there.

With this solution, LinK only health check for HAProxy healthiness, doesn't it?

@Soulou
Copy link
Member

Soulou commented Jul 23, 2019

For that, we could add another port on HAProxy with a new service running which is a health check service. LinK agent query the HAProxy health check port. The service can, for instance, create a TCP connection to the backend in order to see if there is an issue there.

Not a TCP connection (it would get the same about reaching the backend and having weird log lines), but an ICMP ping for instance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants