-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Healthcheck] Protocol dependant checks #61
Comments
I disagree here, LINK role is not to ping the backend, but to check that haproxy is present. If the backend is down, it means that the IP will completely disappear, no one will use it, and apps will get 'no route to host' That's not what we want. We want to reach one HA proxy (having the IP) which will keep the connection for up to 60 seconds (default configuration) and forward them to the backend once it's up again. |
LinK goal is to manage an IP and failover if its backend is not able to perform its work. If there's not backend available the current host is not healthy and should not take the IP. |
But it's not the behavior we want, you're going to drop more connections and create more unavailability doing this. |
And in our use case, the backend is an HAProxy instance, not what HAProxy is proxying to, this has to be monitored differently. |
We can add a TimeBeforeFail on the healthcheck if that's what worrying you (and set it to 70s on PGSQL). |
Not only PGSQL, just all of them |
But still it's not a TimeBeforeFail that we want actually.
-> So it's not even 70s, it's just to have something handling the connections. |
Okay and if there is a SAND issue on the host and on one host the connection between HAProxy and the nodes is broken we will never failover even if we've a prefectly healthy host on the cluster. |
The timeout and crash thing is an edgecase and if the DB is down for more than 60s the user will see errors nevertheless. What's the point of routing IP to a non functionning server ? I get the master failover part where if it's down for less than 60s we do not want to failover. But in your precedent example the user will still see errors. |
To me it's much much less an edgecase than Linux failing the vxlan networking stack. (It's not SAND, SAND is just doing the setup, nothing else then) I don't see the user won't see the errors if the down is long, but if it's a transient 30 seconds errors related to load, it would be almost transparent (except the delay itself) and then the connections would be transmitted to the backend. We would route to a fully working IP, which is accepting connection and waiting for a backend to be ready. (all proxies would be in that state) I prefer having clients risking their chance getting a connection and waiting 60 seconds, than dropping all the packets by default. |
But after 60s you would drop the packet nevertheless. If it's a 30s errors related to load we wont failover because 30s is less than the 60s configured. The IP is not fully working if there's not backend behind it. |
Plus it's not only the vxlan networking stack. It could be also networking between the host running the HAProxy and the host running the current master pgsql db. |
Yes we would drop the packet after 60 seconds, but it would have been retried during those 60 seconds, increasing the chances that it gets through the backend compared if we would have dropped it at once. if it's a 180s incident related to load, that 10 connections arrive at 50s and 10 connections are created at 160s, 10 connections would go through, you would drop the 20 of them I think if the networking falls between hosts of a virtual infrastructure, this would be the least of our problems... |
I think that LinK health checks should only be about HAProxy and not what is behind for all the reasons @Soulou said. But @johnsudaar got a point: if there is a network issue between HAProxy and the backend, the LinK IP should go to another HAProxy. For that, we could add another port on HAProxy with a new service running which is a health check service. LinK agent queries the HAProxy health check port. The service can, for instance, create a TCP connection to the backend in order to see if there is an issue there. With this solution, LinK only health check for HAProxy healthiness, doesn't it? |
Not a TCP connection (it would get the same about reaching the backend and having weird log lines), but an ICMP ping for instance |
Health check should be protocol dependant.
If we're using LinK for a distributed PGSQL we should have a PGSQL health check. TCP check is not enough (especially if it's proxied by another service (haproxy, nginx)).
The text was updated successfully, but these errors were encountered: