-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: retry peer starting #13
Conversation
this.startPeerJwsRefreshLoop(); | ||
this.deps.logger.debug('certs and token are ready.'); | ||
} | ||
this.startPeer(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, we need await here. Otherwise, we start listening to incoming requests without being able to process them (we don't have required state: valid certs fro TLS and access token.
init('A'); | ||
init('B'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
await?
this.deps.logger.info('Certs and token are ready.', { peer: which }); | ||
|
||
} catch (error) { | ||
if (this.retryAgents) this[`timeout${which}`] = setTimeout(() => init(which), 60000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I don't like this approach. Let me try to explain, why
In case of failure in init
function, you suggest starting ISPA in any case, right?
So here I see 3 problems:
- For, at least, near a minute all incoming requests will fail, coz we don't have proper internal state (no certs or access token). What is the reason to start server if it's "not ready"?
- Things might get worse, if the issue in
init
is permanent (e.g. misconfiguration), so the server will stack in retry loop forever. And all incoming requests will fail, but we won't see it. It's a zombie process - it seems to be working, but can't do anything. - Let's imaging we did some changes and introduced an issue, which causes
init
failure (unintentionally, of course). In current approach, k8s will try to spin up a new container, see it fails, and won't delete previous version. But in your approach we easily may introduce "regression" of the service.
Aslo we need to keep in mind, that for proxy flow to be working, we need to have both proxy servers (A and B) up and running with proper internal state.
To sum up my point - we shouldn't accept incoming requests unless both proxy servers have required internal starts (certs and access token). And of course, we should have monitoring and alerting of all pods statuses, and in case of CrashLoopBackOff case, it should be a notification, and we have to react accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is with Argo CD, which will stop any subsequent deployments if the proxy is crashing. This means we cannot fully deploy the regional hub until all buffer schemes are working. Also if some buffer scheme is down, this may create problems in the regional hub. Not starting the proxy will result also in error, which will be a generic one, like service not available or similar. It may be better to start the proxy and respond with better error, which includes details about what exactly is missing in the proxy. We can put a limit for the retries, to avoid the potential zombie case.
"we shouldn't accept incoming requests unless both proxy servers have required internal starts" - isn't it better to accept them and return a proper error instead of crash looping, where it is not very clear what the problem is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I shared my point. If proxy doesn't have proper state (certs or accessToken), it's clear it won't be able serve any incoming requests. If so, the whole proxy flow with regional and buffer schemes is not working. I don't get why we need to allow "broken" proxy handle incoming requests and produce tones of error logs. Instead we have to see problems with proxy immediately (CrashLoopBackOff means service has huge problems, please react ASAP), and not wait a minute (or more), and rely on assumption that the issue will be solved "by its own ".
Sorry, but it seems to me we're trying to solve one small issue (with ArgoCD) by introducing a bit more critical issue (check my 3 points in initial comment)
implemented in #17 |
This will not crash the server if a peer cannot be initialized, but will instead retry them independently each minute.