Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Restart worker thread on uncaught exception #853

Closed
wants to merge 3 commits into from

Conversation

morgsmccauley
Copy link
Collaborator

Currently, we terminate the worker thread if there is an uncaught exception. It was previously thought that uncaught exceptions were unrecoverable, but that is not always the case, e.g. uncaught PG connection failures can and should be retried. This PR updates that behaviour so the worker is restarted, rather than terminated.

@morgsmccauley morgsmccauley requested a review from a team as a code owner July 3, 2024 02:38
@morgsmccauley
Copy link
Collaborator Author

On second thought, this exact approach may not be the best idea. While its true that some errors are transient and should be retried, but same are not, and will cause the worker to boot loop. Boot looping workers will not be caught by our alerting, so this essentially masquerades those issues 🤔

@morgsmccauley
Copy link
Collaborator Author

On third thought, this is probably the behaviour we want long term - Executors attempt to make progress, and Coordinator can measure that progress and officially stop the Executor if needed.

But how will we know if an Executor is truely broken?

@darunrs
Copy link
Collaborator

darunrs commented Jul 6, 2024

One idea is we can create a base Exception class and represent any known error types as different types of exceptions. We can then enrich the IndexerStatus to also provide just the exception type. If it is a known type, we can deal with it explicitly. If it is not a known type, we can configure some default behavior (Perhaps X retries before we stop). What do you think?

Our restarts of worker are fairly quick all things considered, so I think us retrying them isn't too problematic. There's only two actors which can fix an "broken" indexer: Time and the Developer. Time being transient errors, and Developers being Us or the Indexer creator. Exponential restarts is possible, with an update to the Indexer resetting the timer. If the error is the Indexer, and not our service, there's nothing we can do about it. Plus, we're dealing with downed workers, not just failing workers. I think it's uncommon for an Executor to consistently get downed by user logic right?

@morgsmccauley
Copy link
Collaborator Author

One idea is we can create a base Exception class and represent any known error types as different types of exceptions. We can then enrich the IndexerStatus to also provide just the exception type. If it is a known type, we can deal with it explicitly. If it is not a known type, we can configure some default behavior (Perhaps X retries before we stop). What do you think?

I've thought of this to, my only concern is attempting to wrap every known error, which I'm not sure is possible. Especially when you take async errors in to consideration. This won't be caught by our traditional error handlers, and therefore can't be wrapped.

If the error is the Indexer, and not our service, there's nothing we can do about it. Plus, we're dealing with downed workers, not just failing workers. I think it's uncommon for an Executor to consistently get downed by user logic right?

User logic can still terminate the thread, e.g. failing/non-awaited promises, and syntax errors in general I think. Its hard to distinguish these from real errors, which is what I'm struggling with in this PR 🤔

@darunrs
Copy link
Collaborator

darunrs commented Jul 11, 2024

Yeah you are correct. What makes this hard is that this error is caught by the main thread after passing through the worker thread boundary, meaning it gets serialized, stripping type information and other things. So, it's pretty hard to attach context, especially since this is considered abnormal behavior and thus usually lacks context to begin with. I've found we generally don't deeply investigate these issues deeply anyway due to that reason. We usually just restart the Indexer. Maybe this is problematic behavior and adding auto restarts is further reinforcing this, but I think our service is just really vulnerable to async errors, due to our usage of worker threads. Since Indexer uptime is one of our team goals, I think we should proceed with automatic retries given we have the following approach available for deeper analysis:

  1. We have a metric on STOPPED indexer status counts
  2. We can use the above metric to accurate tell WHEN an indexer went down
  3. We can look at GCP logs at any time since the metric graph lets us know when the indexer died, and thus where to look at logs for
  4. If we see any problematic patterns such as oscillating STOPPED counts, we investigate deeper with high priority.

This allows us to hit our uptime goals while still giving us the ability to do a retro easily with enough information to maybe get to a root cause.

@morgsmccauley
Copy link
Collaborator Author

Superseded by #891 - Coordinator handles this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants