Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redeploys or restarts never complete if ngrok sessions are limited to 1 #266

Open
jrobsonchase opened this issue Jul 3, 2023 · 3 comments
Labels
area/controller Issues dealing with the controller bug Something isn't working priority/medium

Comments

@jrobsonchase
Copy link
Contributor

What happened

A deploy or rollout restart causes a new replicaset to get created with the new pod spec in parallel with the old one. Once this is marked "Ready", the old one is terminated.

If agent sessions are limited to 1, as with free ngrok accounts today, this will never actually occur, since the ingress controller pod's rediness is contingent on the agent session starting successfully.

There are also no logs to indicate why the new pods get stuck.

What you think should happen instead

Hard to say. Maybe we should detect the "too many sessions" error, and allow the pod to ready up if it persists for a certain amount of time? Forcing the old one to close before the new one can start will result in a blip in uptime, so we still want to wait on the session to fully start if we can, but we still shouldn't block forever in the event that it never will.

How to reproduce

make deploy with a free account is sufficient, since it does a rollout restart to force a re-pull. You'll see two ingress controller pods and two replicasets. The deploy will list one as the "old" and the other as "new". The "new" one's pod will be stuck unready.

@jrobsonchase jrobsonchase added bug Something isn't working area/controller Issues dealing with the controller needs-triage Issues that need triage labels Jul 3, 2023
@alex-bezek
Copy link
Collaborator

Forcing the old one to close before the new one can start will result in a blip in uptime, so we still want to wait on the session to fully start if we can, but we still shouldn't block forever in the event that it never will.

Is it possible to wait for the new session to start before closing the old one if there is a limit of 1 agent session for free accounts? If you are limited to a single active session, it seems we'll always have to tear one down before creating the new one which will always have a small blip of downtime for free accounts.

We technically could make a helm configuration that allows free accounts to specify a Recreate Rollout Strategy instead of a Rolling Update, but the DevX for free accounts to remember to add a specific helm override isn't the greatest.

@jrobsonchase
Copy link
Contributor Author

Is it possible to wait for the new session to start before closing the old one if there is a limit of 1 agent session for free accounts? If you are limited to a single active session, it seems we'll always have to tear one down before creating the new one which will always have a small blip of downtime for free accounts.

That's basically what we're doing already, but there's no way to tell the first session that it's time to shut down while the new one is coming up. The new instance can start, but it'll never get marked Ready until its tunnel session starts, which means the original will never get told to shut down.

We could go ahead and mark it "Ready" without tunnel session readiness, which will shut down the original, and cause the unavoidable blip for free accounts, and a racey blip for accounts with higher limits but only a single controller instance. If there are multiple controller pods, and they don't all go out simultaneously, it should be fine. That's probably the simplest fix.

The other is to check the error that comes back from a failure to start the tunnel session, and go ahead and mark the pod as Ready if it's a limit error.

@devpro
Copy link

devpro commented Jan 9, 2024

Is it possible to update the Helm chart to be able to override the default spec.strategy.type (RollingUpdate -> Recreate)?

@alex-bezek alex-bezek added priority/medium and removed needs-triage Issues that need triage labels Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Issues dealing with the controller bug Something isn't working priority/medium
Projects
None yet
Development

No branches or pull requests

3 participants