-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Restart worker thread on uncaught exception #853
Conversation
On second thought, this exact approach may not be the best idea. While its true that some errors are transient and should be retried, but same are not, and will cause the worker to boot loop. Boot looping workers will not be caught by our alerting, so this essentially masquerades those issues 🤔 |
On third thought, this is probably the behaviour we want long term - Executors attempt to make progress, and Coordinator can measure that progress and officially stop the Executor if needed. But how will we know if an Executor is truely broken? |
One idea is we can create a base Exception class and represent any known error types as different types of exceptions. We can then enrich the IndexerStatus to also provide just the exception type. If it is a known type, we can deal with it explicitly. If it is not a known type, we can configure some default behavior (Perhaps X retries before we stop). What do you think? Our restarts of worker are fairly quick all things considered, so I think us retrying them isn't too problematic. There's only two actors which can fix an "broken" indexer: Time and the Developer. Time being transient errors, and Developers being Us or the Indexer creator. Exponential restarts is possible, with an update to the Indexer resetting the timer. If the error is the Indexer, and not our service, there's nothing we can do about it. Plus, we're dealing with downed workers, not just failing workers. I think it's uncommon for an Executor to consistently get downed by user logic right? |
I've thought of this to, my only concern is attempting to wrap every known error, which I'm not sure is possible. Especially when you take async errors in to consideration. This won't be caught by our traditional error handlers, and therefore can't be wrapped.
User logic can still terminate the thread, e.g. failing/non-awaited promises, and syntax errors in general I think. Its hard to distinguish these from real errors, which is what I'm struggling with in this PR 🤔 |
Yeah you are correct. What makes this hard is that this error is caught by the main thread after passing through the worker thread boundary, meaning it gets serialized, stripping type information and other things. So, it's pretty hard to attach context, especially since this is considered abnormal behavior and thus usually lacks context to begin with. I've found we generally don't deeply investigate these issues deeply anyway due to that reason. We usually just restart the Indexer. Maybe this is problematic behavior and adding auto restarts is further reinforcing this, but I think our service is just really vulnerable to async errors, due to our usage of worker threads. Since Indexer uptime is one of our team goals, I think we should proceed with automatic retries given we have the following approach available for deeper analysis:
This allows us to hit our uptime goals while still giving us the ability to do a retro easily with enough information to maybe get to a root cause. |
Superseded by #891 - Coordinator handles this now. |
Currently, we terminate the worker thread if there is an uncaught exception. It was previously thought that uncaught exceptions were unrecoverable, but that is not always the case, e.g. uncaught PG connection failures can and should be retried. This PR updates that behaviour so the worker is restarted, rather than terminated.