fix: retry peer starting #13

kalinkrustev · 2024-07-22T08:49:38Z

This will not crash the server if a peer cannot be initialized, but will instead retry them independently each minute.

geka-evk · 2024-07-22T08:56:18Z

src/domain/InterSchemeProxyAdapter.ts

-      this.startPeerJwsRefreshLoop();
-      this.deps.logger.debug('certs and token are ready.');
-    }
+    this.startPeer();


I think, we need await here. Otherwise, we start listening to incoming requests without being able to process them (we don't have required state: valid certs fro TLS and access token.

geka-evk · 2024-07-22T08:56:58Z

src/domain/InterSchemeProxyAdapter.ts

+      init('A');
+      init('B');


geka-evk · 2024-07-22T09:32:07Z

src/domain/InterSchemeProxyAdapter.ts

+        this.deps.logger.info('Certs and token are ready.', { peer: which });
+
+      } catch (error) {
+        if (this.retryAgents) this[`timeout${which}`] = setTimeout(() => init(which), 60000);


To be honest, I don't like this approach. Let me try to explain, why

In case of failure in init function, you suggest starting ISPA in any case, right?
So here I see 3 problems:

For, at least, near a minute all incoming requests will fail, coz we don't have proper internal state (no certs or access token). What is the reason to start server if it's "not ready"?

Things might get worse, if the issue in init is permanent (e.g. misconfiguration), so the server will stack in retry loop forever. And all incoming requests will fail, but we won't see it. It's a zombie process - it seems to be working, but can't do anything.

Let's imaging we did some changes and introduced an issue, which causes init failure (unintentionally, of course). In current approach, k8s will try to spin up a new container, see it fails, and won't delete previous version. But in your approach we easily may introduce "regression" of the service.

Aslo we need to keep in mind, that for proxy flow to be working, we need to have both proxy servers (A and B) up and running with proper internal state.

To sum up my point - we shouldn't accept incoming requests unless both proxy servers have required internal starts (certs and access token). And of course, we should have monitoring and alerting of all pods statuses, and in case of CrashLoopBackOff case, it should be a notification, and we have to react accordingly

The problem is with Argo CD, which will stop any subsequent deployments if the proxy is crashing. This means we cannot fully deploy the regional hub until all buffer schemes are working. Also if some buffer scheme is down, this may create problems in the regional hub. Not starting the proxy will result also in error, which will be a generic one, like service not available or similar. It may be better to start the proxy and respond with better error, which includes details about what exactly is missing in the proxy. We can put a limit for the retries, to avoid the potential zombie case.

"we shouldn't accept incoming requests unless both proxy servers have required internal starts" - isn't it better to accept them and return a proper error instead of crash looping, where it is not very clear what the problem is.

I shared my point. If proxy doesn't have proper state (certs or accessToken), it's clear it won't be able serve any incoming requests. If so, the whole proxy flow with regional and buffer schemes is not working. I don't get why we need to allow "broken" proxy handle incoming requests and produce tones of error logs. Instead we have to see problems with proxy immediately (CrashLoopBackOff means service has huge problems, please react ASAP), and not wait a minute (or more), and rely on assumption that the issue will be solved "by its own ".

Sorry, but it seems to me we're trying to solve one small issue (with ArgoCD) by introducing a bit more critical issue (check my 3 points in initial comment)

kalinkrustev · 2024-08-19T08:08:13Z

implemented in #17

kalinkrustev added 2 commits July 21, 2024 08:27

ci(snapshot): 1.1.2-snapshot.0

d22415e

ci(snapshot): 1.1.2-snapshot.1

2dff4ae

kalinkrustev requested a review from geka-evk July 22, 2024 08:55

geka-evk reviewed Jul 22, 2024

View reviewed changes

src/domain/InterSchemeProxyAdapter.ts

Comment on lines +98 to +99

init('A');

init('B');

Copy link

Collaborator

geka-evk Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

await?

geka-evk reviewed Jul 22, 2024

View reviewed changes

kalinkrustev closed this Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: retry peer starting #13

fix: retry peer starting #13

kalinkrustev commented Jul 22, 2024 •

edited

Loading

geka-evk Jul 22, 2024

geka-evk Jul 22, 2024

geka-evk Jul 22, 2024 •

edited

Loading

kalinkrustev Jul 22, 2024

geka-evk Jul 22, 2024 •

edited

Loading

kalinkrustev commented Aug 19, 2024

fix: retry peer starting #13

fix: retry peer starting #13

Conversation

kalinkrustev commented Jul 22, 2024 • edited Loading

geka-evk Jul 22, 2024

Choose a reason for hiding this comment

geka-evk Jul 22, 2024

Choose a reason for hiding this comment

geka-evk Jul 22, 2024 • edited Loading

Choose a reason for hiding this comment

kalinkrustev Jul 22, 2024

Choose a reason for hiding this comment

geka-evk Jul 22, 2024 • edited Loading

Choose a reason for hiding this comment

kalinkrustev commented Aug 19, 2024

kalinkrustev commented Jul 22, 2024 •

edited

Loading

geka-evk Jul 22, 2024 •

edited

Loading

geka-evk Jul 22, 2024 •

edited

Loading