feat: Restart worker thread on uncaught exception #853

morgsmccauley · 2024-07-03T02:38:33Z

Currently, we terminate the worker thread if there is an uncaught exception. It was previously thought that uncaught exceptions were unrecoverable, but that is not always the case, e.g. uncaught PG connection failures can and should be retried. This PR updates that behaviour so the worker is restarted, rather than terminated.

morgsmccauley · 2024-07-03T03:26:48Z

On second thought, this exact approach may not be the best idea. While its true that some errors are transient and should be retried, but same are not, and will cause the worker to boot loop. Boot looping workers will not be caught by our alerting, so this essentially masquerades those issues 🤔

morgsmccauley · 2024-07-04T07:07:10Z

On third thought, this is probably the behaviour we want long term - Executors attempt to make progress, and Coordinator can measure that progress and officially stop the Executor if needed.

But how will we know if an Executor is truely broken?

darunrs · 2024-07-06T04:31:42Z

One idea is we can create a base Exception class and represent any known error types as different types of exceptions. We can then enrich the IndexerStatus to also provide just the exception type. If it is a known type, we can deal with it explicitly. If it is not a known type, we can configure some default behavior (Perhaps X retries before we stop). What do you think?

Our restarts of worker are fairly quick all things considered, so I think us retrying them isn't too problematic. There's only two actors which can fix an "broken" indexer: Time and the Developer. Time being transient errors, and Developers being Us or the Indexer creator. Exponential restarts is possible, with an update to the Indexer resetting the timer. If the error is the Indexer, and not our service, there's nothing we can do about it. Plus, we're dealing with downed workers, not just failing workers. I think it's uncommon for an Executor to consistently get downed by user logic right?

morgsmccauley · 2024-07-07T21:52:47Z

One idea is we can create a base Exception class and represent any known error types as different types of exceptions. We can then enrich the IndexerStatus to also provide just the exception type. If it is a known type, we can deal with it explicitly. If it is not a known type, we can configure some default behavior (Perhaps X retries before we stop). What do you think?

I've thought of this to, my only concern is attempting to wrap every known error, which I'm not sure is possible. Especially when you take async errors in to consideration. This won't be caught by our traditional error handlers, and therefore can't be wrapped.

If the error is the Indexer, and not our service, there's nothing we can do about it. Plus, we're dealing with downed workers, not just failing workers. I think it's uncommon for an Executor to consistently get downed by user logic right?

User logic can still terminate the thread, e.g. failing/non-awaited promises, and syntax errors in general I think. Its hard to distinguish these from real errors, which is what I'm struggling with in this PR 🤔

darunrs · 2024-07-11T12:58:56Z

Yeah you are correct. What makes this hard is that this error is caught by the main thread after passing through the worker thread boundary, meaning it gets serialized, stripping type information and other things. So, it's pretty hard to attach context, especially since this is considered abnormal behavior and thus usually lacks context to begin with. I've found we generally don't deeply investigate these issues deeply anyway due to that reason. We usually just restart the Indexer. Maybe this is problematic behavior and adding auto restarts is further reinforcing this, but I think our service is just really vulnerable to async errors, due to our usage of worker threads. Since Indexer uptime is one of our team goals, I think we should proceed with automatic retries given we have the following approach available for deeper analysis:

We have a metric on STOPPED indexer status counts
We can use the above metric to accurate tell WHEN an indexer went down
We can look at GCP logs at any time since the metric graph lets us know when the indexer died, and thus where to look at logs for
If we see any problematic patterns such as oscillating STOPPED counts, we investigate deeper with high priority.

This allows us to hit our uptime goals while still giving us the ability to do a retro easily with enough information to maybe get to a root cause.

morgsmccauley · 2024-07-17T21:37:20Z

Superseded by #891 - Coordinator handles this now.

morgsmccauley added 3 commits July 3, 2024 14:17

refactor: Use never instead of void

09c030c

refactor: Extract implicit worker starting to own method

c5c18c9

feat: Restart worker thread on uncaught exception

75f3b11

morgsmccauley requested a review from a team as a code owner July 3, 2024 02:38

morgsmccauley closed this Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Restart worker thread on uncaught exception #853

feat: Restart worker thread on uncaught exception #853

morgsmccauley commented Jul 3, 2024

morgsmccauley commented Jul 3, 2024

morgsmccauley commented Jul 4, 2024

darunrs commented Jul 6, 2024 •

edited

Loading

morgsmccauley commented Jul 7, 2024

darunrs commented Jul 11, 2024 •

edited

Loading

morgsmccauley commented Jul 17, 2024

feat: Restart worker thread on uncaught exception #853

feat: Restart worker thread on uncaught exception #853

Conversation

morgsmccauley commented Jul 3, 2024

morgsmccauley commented Jul 3, 2024

morgsmccauley commented Jul 4, 2024

darunrs commented Jul 6, 2024 • edited Loading

morgsmccauley commented Jul 7, 2024

darunrs commented Jul 11, 2024 • edited Loading

morgsmccauley commented Jul 17, 2024

darunrs commented Jul 6, 2024 •

edited

Loading

darunrs commented Jul 11, 2024 •

edited

Loading