Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: make geoip client work even if redis is broken #389

Merged
merged 1 commit into from
Jul 11, 2023

Conversation

alexey-yarmosh
Copy link
Member

Fixes #388

At the time of the issue redis was in a strange state, it was rejecting all the queries (GET, SET, etc.). Our geo ip logic uses redis to get cached value before making real requests to the 3rd party geo APIs. Since get command was throwing errors the whole geo ip function was throwing and API were unable to get the geo ips of probes and sent unresolvable geoip to them.

Probes were constantly reconnecting when API was down. But they stopped reconnecting after getting unresolvable geoip error from the API, which is valid behaviour from the perspective of the probe.

In that fix we are catching any geo ip redis errors and making request to the 3rd party geo APIs not only if there is no value in the cache but also if the cache itself is buggy/unavailable.

@MartinKolarik
Copy link
Member

But they stopped reconnecting after getting unresolvable geoip error from the API, which is valid behaviour from the perspective of the probe.

We should try to reconnect with some longer delay here as well to handle occasional 3rd party API failures, geo data updates, etc. If the probes don't attempt to reconnect at all, I'd suggest adding a retry after 1 hour or so for all such cases.

@alexey-yarmosh
Copy link
Member Author

Seems like the "stop reconnecting in case of unresolvable geoip" logic is not intentional. Here is the explicit code to stop reconnects in some cases, but unresolvable geoip is not in the fatalConnectErrors list. Real reason why it is not reconnecting is because socketio not reconnecting for errors from middlewares. It may also not reconnect for other similar reason. So we need to add an explicit socket.connect() to fix that:

if (isFatalError) {
  // At that stage socket.connected=false already,
  // but we want to stop reconnections for fatal errors
  socket.disconnect();
} else {
  socket.connect();
}

If we want to persist the current not reconnect logic, we can add unresolvable geoip to fatalConnectErrors list. But such probes are reconnected only once a week, after a restart.

@MartinKolarik
Copy link
Member

The error is changed to failed to collect probe metadata at

let message = 'failed to collect probe metadata';

and that one is on the list, so I think that works as originally intended?

My point still is that it might be better to not drop probes forever, only apply some long-ish timeout.

@alexey-yarmosh
Copy link
Member Author

alexey-yarmosh commented Jul 10, 2023

failed to collect probe metadata is replaced with unresolvable geoip: x.x.x.x here.

Size of delay surely can be set to any value, we just have 7 days by default.

@MartinKolarik
Copy link
Member

I see. Let's fix the reconnecting and reduce the timeout, then.

@alexey-yarmosh
Copy link
Member Author

Do you think it should be reduced for all fatalConnectErrors or just for unresolvable geo?

@MartinKolarik
Copy link
Member

I'd say all. 1 hour should be enough to prevent the load caused by constant reconnecting and, at the same time, reasonably short to fix any temporary failures.

@alexey-yarmosh
Copy link
Member Author

I'd say all. 1 hour should be enough to prevent the load caused by constant reconnecting and, at the same time, reasonably short to fix any temporary failures.

Adding that logic revealed an #390 issue. So it will be implemented along with the fix. But current PR is safe to merge and deploy right now, so lets do that.

@MartinKolarik MartinKolarik merged commit 99882e9 into master Jul 11, 2023
2 checks passed
@MartinKolarik MartinKolarik deleted the geoip-cache branch July 11, 2023 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Probes dont reconnect after API returns 503 errors
2 participants