-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Improved crash and error recovery in the asset daemon (#17344)
Summary: This PR adds logic to the asset daemon to ensure that if it crashes or raises an exception in the middle of a tick, subsequent ticks resume where they left off without launching duplicate runs. It allows us to submit runs one by one as they are computed, rather than computing the snapshots for every single run in the tick one by one before launching them (in case there is a failure partway through). The way that we do this safely is by: - first computing the evaluations, getting a bunch of run requests out - Storing those run requests on the tick, along with a bunch of reserved run IDs - Before each run is launched, double check that is wasn't already launched - Write asset evaluations before the runs are launched (without the run IDs) and afterwards again (including the run IDs) - so that we don't lose evaluations either if there are crashes. To account for crashing partway through, the daemon first looks at the state of the most recent tick. If it crashed (i.e. it is STARTED) or raised an exception (it is FAILED), it pulls the in-progress run requests and cursor off of that tick and picks up where it left off. The error case only retries a certain number of times before writing the cursor anyway and moving on (this is a change from the current behavior, but allows a certain amount of transient errors before we move on) Much like with the scheduler, a "UserCodeUnreachableError" (i.e a code server is down, or the Dagster Cloud agent is down) causes ticks to retry indefinitely until the code is available again. Still a bit more testing to add here, but I figured it was ready for any initial feedabck. ## Summary & Motivation ## How I Tested These Changes
- Loading branch information
Showing
7 changed files
with
1,008 additions
and
315 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.