[asset backfill] Pause when code server is down #19494

clairelin135 · 2024-01-30T20:02:24Z

Fixes #19484.

Users were observing a backfill failure that occurred because the user code server was temporarily unreachable when creating runs. For resiliency, the ideal behavior in this case is to pause the backfill and then reevaluate it on the next iteration, to allow other backfills to be evaluated in the meantime.

This PR adds the following behavior whenever attempting to create a run results in a code location unloadable error:

Sets the cursor to be the prior iteration's cursor value. This allows the next iteration to reevaluate the same materializations as the current iteration, allowing for downstreams of those newly materialized assets to be kicked off.
Early exits the backfill iteration, skipping submission of the subsequent runs.

Existing asset backfill idempotency logic guarantees that if a target partition is already requested, it will not be re-requested.

clairelin135 · 2024-01-30T20:02:38Z

Current dependencies on/for this PR:

master
- PR [asset backfill] Pause when code server is down #19494 👈

This stack of pull requests is managed by Graphite.

gibsondan

I put a couple of thoughts inline but this makes sense to me, not sure if @sryza wants to make a pass too

The automatic test here looks good, I think doing a manual test here once would be a good idea as well to make sure this performs as expected

For example, starting a backfill (can add some sleeps strategically to make it longer) and shutting down the code server in the middle of the backfill, then starting it back up and verifying that it resumes as expected

gibsondan · 2024-01-31T04:24:30Z

python_modules/dagster/dagster/_core/execution/asset_backfill.py

@@ -155,6 +155,17 @@ def replace_requested_subset(self, requested_subset: AssetGraphSubset) -> "Asset
            backfill_start_time=self.backfill_start_time,
        )

+    def replace_latest_storage_id(self, latest_storage_id: Optional[int]) -> "AssetBackfillData":


maybe with_latest_storage_id?

gibsondan · 2024-01-31T04:24:59Z

python_modules/dagster/dagster/_core/execution/asset_backfill.py

+        return AssetBackfillData(
+            target_subset=self.target_subset,
+            latest_storage_id=latest_storage_id,
+            requested_runs_for_target_roots=self.requested_runs_for_target_roots,
+            materialized_subset=self.materialized_subset,
+            failed_and_downstream_subset=self.failed_and_downstream_subset,
+            requested_subset=self.requested_subset,
+            backfill_start_time=self.backfill_start_time,
+        )


retuern self._replace(latest_storage_id=latest_storage_id)?

ahh this is better

gibsondan · 2024-01-31T04:26:16Z

python_modules/dagster/dagster/_core/execution/submit_asset_runs.py

@@ -259,6 +263,11 @@ def submit_asset_run(
    return run_to_submit


+class SubmitRunRequestChunkResult(NamedTuple):
+    chunk_submitted_runs: Sequence[Tuple[RunRequest, DagsterRun]]
+    code_unreachable_error_raised: bool


could this also be called "retryable_error_raised"?

yes, good call, will switch to this

gibsondan · 2024-01-31T04:32:30Z

python_modules/dagster/dagster/_core/execution/asset_backfill.py

+            # Code server became unavailable mid-backfill. Rewind the cursor back to the cursor
+            # from the previous iteration, to allow next iteration to reevaluate the same
+            # events.


do we need do anything here to prevent the same backfill from iterating in a tight loop if this happens and its always failing? or does the fact that the backfill daemon is an IntervalDaemon keep that from happening?

just confirming - this does not stop other backfills from executing in parallel right?

Every 30s, the daemon searches for all in-progress backfills and loops through each one, executing an asset backfill iteration for each.

So if a backfill had a code location error, it would be re-evaluated for on every daemon "loop". But on each loop, all the other backfills could successfully execute.

gibsondan · 2024-01-31T19:50:24Z

python_modules/dagster/dagster/_core/execution/asset_backfill.py

+        if submit_run_request_chunk_result is None:
            # allow the daemon to heartbeat
            yield None
            continue

+        code_unreachable_error_raised = (
+            submit_run_request_chunk_result.code_unreachable_error_raised


this isn't neccesarily related to this PR specifically, but I found the logic around this for loop pretty hard to wrap my head around - in particular the fact that it will only ever return None until it returns something, then we can guarantee on that something being the last iteration (believe I have that right?) Without that understanding it seemed like things were going to just keep on going (within the same iteration) even after one of the runs failed to submit, but that is not the case - once the submissions either finish or raise this particular error, the iteration is over

One way to make that more clear would be to move most of the logic here outside of the for loop - to make it clear that once submit_run_request_chunk_result is not None, the for loop is guaranteed to be finished

it will only ever return None until it returns something, then we can guarantee on that something being the last iteration (believe I have that right?)

This is a confusing loop... it returning something doesn't mean it is actually the last iteration though.

The function will return a result for each run request chunk, yielding many Nones between each run request chunk to all the daemon to heartbeat.

once the submissions either finish or raise this particular error, the iteration is over

This part is true though, enabled by the break statements

Let me know if there are other ways you can imagine to make this more digestible and I can file a follow-up PR.

gibsondan · 2024-01-31T19:52:51Z

python_modules/dagster/dagster/_core/execution/asset_backfill.py

+            backfill_data_with_submitted_runs = (
+                backfill_data_with_submitted_runs.replace_latest_storage_id(
+                    previous_asset_backfill_data.latest_storage_id
+                )
+            )


an alternative would be to, rather than updating the storage Id and then potentially un-updating it, waiting until here to update it in the 'expected' case where there is no error (and not doing that if there was an error). That won't work if we use latest_storage_id before this point though and expect it to be the new value

This is a good callout.

We're currently doing a weird thing where we call execute_asset_backfill_iteration_inner, which returns the expected resultant asset backfill data (included requested partitions and latest storage id) after all runs are submitted. This function is used for testing to assert that a backfill iteration will result in X partitions being requested.

Then because we might have mid-iteration backfill cancellations and code location errors like this happening during run submission, we un-update the requested partitions and the cursor when unexpected things happen.

I think we should refactor execute_asset_backfill_iteration_inner to stop doing these un-updates. Instead we could handle observed changes (updated materialized/failed partitions) separately from expected state updates (partitions to request/next cursor), which I can do if I get the chance.

In the meantime, because there's already existing logic to do "un-updates", I'd prefer to keep the cursor "un-update" logic with the existing logic.

gibsondan · 2024-01-31T19:53:33Z

python_modules/dagster/dagster/_core/execution/submit_asset_runs.py

@@ -268,7 +277,9 @@ def submit_asset_runs_in_chunks(
    asset_graph: ExternalAssetGraph,
    debug_crash_flags: SingleInstigatorDebugCrashFlags,
    logger: logging.Logger,
-) -> Iterator[Optional[Sequence[Tuple[RunRequest, DagsterRun]]]]:
+    backfill_id: Optional[str] = None,
+    catch_code_location_load_errors: bool = False,


i only see one callsite of this method - is there a reason it needs to be toggleable?

Yeah only one callsite, i'm ok with making it non-toggleable

github-actions · 2024-02-01T22:22:55Z

Deploy preview for dagster-docs ready!

Preview available at https://dagster-docs-j5lfu0mb2-elementl.vercel.app
https://01-30-claire-asset-backfill-resiliency-code-server-down.dagster.dagster-docs.io

Direct link to changed pages:

github-actions · 2024-02-01T22:23:50Z

Deploy preview for dagit-core-storybook ready!

✅ Preview
https://dagit-core-storybook-hc1rvk0n5-elementl.vercel.app
https://01-30-claire-asset-backfill-resiliency-code-server-down.core-storybook.dagster-docs.io

Built with commit aa69b7d.
This pull request is being automatically deployed with vercel-action

sryza · 2024-02-01T22:34:17Z

From reading the description, this makes sense to me. I'm ✅ if @gibsondan is.

clairelin135 · 2024-02-02T00:30:33Z

Yup, tested locally and backfills are functioning as expected

clairelin135 force-pushed the 01-30-claire/asset-backfill-resiliency-code-server-down branch from 8121c76 to 017b4c8 Compare January 30, 2024 21:18

clairelin135 marked this pull request as ready for review January 30, 2024 21:36

clairelin135 requested review from gibsondan, OwenKephart, sryza and smackesey January 30, 2024 22:44

gibsondan approved these changes Jan 31, 2024

View reviewed changes

clairelin135 force-pushed the 01-30-claire/asset-backfill-resiliency-code-server-down branch from 017b4c8 to aa69b7d Compare February 1, 2024 22:18

clairelin135 force-pushed the 01-30-claire/asset-backfill-resiliency-code-server-down branch from aa69b7d to ec51322 Compare February 2, 2024 00:21

clairelin135 added 3 commits February 1, 2024 16:28

claire/asset-backfill-resiliency-code-server-down

e1f30e3

add test case

4d1688a

pr feedback

d37db51

clairelin135 force-pushed the 01-30-claire/asset-backfill-resiliency-code-server-down branch from ec51322 to d37db51 Compare February 2, 2024 00:29

clairelin135 merged commit 3bc3964 into master Feb 2, 2024
1 check was pending

clairelin135 deleted the 01-30-claire/asset-backfill-resiliency-code-server-down branch February 2, 2024 00:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[asset backfill] Pause when code server is down #19494

[asset backfill] Pause when code server is down #19494

clairelin135 commented Jan 30, 2024 •

edited

Loading

clairelin135 commented Jan 30, 2024

gibsondan left a comment

gibsondan Jan 31, 2024

gibsondan Jan 31, 2024

clairelin135 Feb 1, 2024

gibsondan Jan 31, 2024

clairelin135 Feb 1, 2024

gibsondan Jan 31, 2024

gibsondan Jan 31, 2024

clairelin135 Feb 1, 2024

gibsondan Jan 31, 2024

clairelin135 Feb 1, 2024

gibsondan Jan 31, 2024

clairelin135 Feb 1, 2024

gibsondan Jan 31, 2024

clairelin135 Feb 1, 2024

github-actions bot commented Feb 1, 2024

github-actions bot commented Feb 1, 2024

sryza commented Feb 1, 2024

clairelin135 commented Feb 2, 2024

[asset backfill] Pause when code server is down #19494

[asset backfill] Pause when code server is down #19494

Conversation

clairelin135 commented Jan 30, 2024 • edited Loading

clairelin135 commented Jan 30, 2024

gibsondan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Feb 1, 2024

github-actions bot commented Feb 1, 2024

sryza commented Feb 1, 2024

clairelin135 commented Feb 2, 2024

clairelin135 commented Jan 30, 2024 •

edited

Loading