Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: retry peer starting #13

Closed
wants to merge 2 commits into from
Closed

fix: retry peer starting #13

wants to merge 2 commits into from

Conversation

kalinkrustev
Copy link
Collaborator

@kalinkrustev kalinkrustev commented Jul 22, 2024

This will not crash the server if a peer cannot be initialized, but will instead retry them independently each minute.

this.startPeerJwsRefreshLoop();
this.deps.logger.debug('certs and token are ready.');
}
this.startPeer();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we need await here. Otherwise, we start listening to incoming requests without being able to process them (we don't have required state: valid certs fro TLS and access token.

Comment on lines +98 to +99
init('A');
init('B');
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

await?

this.deps.logger.info('Certs and token are ready.', { peer: which });

} catch (error) {
if (this.retryAgents) this[`timeout${which}`] = setTimeout(() => init(which), 60000);
Copy link
Collaborator

@geka-evk geka-evk Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I don't like this approach. Let me try to explain, why

In case of failure in init function, you suggest starting ISPA in any case, right?
So here I see 3 problems:

  1. For, at least, near a minute all incoming requests will fail, coz we don't have proper internal state (no certs or access token). What is the reason to start server if it's "not ready"?
  2. Things might get worse, if the issue in init is permanent (e.g. misconfiguration), so the server will stack in retry loop forever. And all incoming requests will fail, but we won't see it. It's a zombie process - it seems to be working, but can't do anything.
  3. Let's imaging we did some changes and introduced an issue, which causes init failure (unintentionally, of course). In current approach, k8s will try to spin up a new container, see it fails, and won't delete previous version. But in your approach we easily may introduce "regression" of the service.

Aslo we need to keep in mind, that for proxy flow to be working, we need to have both proxy servers (A and B) up and running with proper internal state.

To sum up my point - we shouldn't accept incoming requests unless both proxy servers have required internal starts (certs and access token). And of course, we should have monitoring and alerting of all pods statuses, and in case of CrashLoopBackOff case, it should be a notification, and we have to react accordingly

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is with Argo CD, which will stop any subsequent deployments if the proxy is crashing. This means we cannot fully deploy the regional hub until all buffer schemes are working. Also if some buffer scheme is down, this may create problems in the regional hub. Not starting the proxy will result also in error, which will be a generic one, like service not available or similar. It may be better to start the proxy and respond with better error, which includes details about what exactly is missing in the proxy. We can put a limit for the retries, to avoid the potential zombie case.

"we shouldn't accept incoming requests unless both proxy servers have required internal starts" - isn't it better to accept them and return a proper error instead of crash looping, where it is not very clear what the problem is.

Copy link
Collaborator

@geka-evk geka-evk Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I shared my point. If proxy doesn't have proper state (certs or accessToken), it's clear it won't be able serve any incoming requests. If so, the whole proxy flow with regional and buffer schemes is not working. I don't get why we need to allow "broken" proxy handle incoming requests and produce tones of error logs. Instead we have to see problems with proxy immediately (CrashLoopBackOff means service has huge problems, please react ASAP), and not wait a minute (or more), and rely on assumption that the issue will be solved "by its own ".

Sorry, but it seems to me we're trying to solve one small issue (with ArgoCD) by introducing a bit more critical issue (check my 3 points in initial comment)

@kalinkrustev
Copy link
Collaborator Author

implemented in #17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants