Log materialization planned event for run with partition range #18305

clairelin135 · 2023-11-27T21:27:14Z

This PR enables the UI to correctly display failed/in progress status for single-run backfills, in cloud.

This includes two major changes:

1. Log a materialization planned event containing a partitions subset for single-run backfills upon run creation

If all of the below conditions apply, we log an partitions subset planned event for each asset:

Executing instance is a cloud instance
Run targets a partition range
Asset is partitioned (available on execution plan snapshot)

Otherwise, we fallback to the status quo behavior:

If run has a single partition, log a planned event w/ a partition key for each asset
If run targets a partition range, log a planned event w/ partition=None. We still log a planned event in this case because in certain places (i.e. asset backfills) we query planned events to see which assets are planned to execute.

2. Threads an `assets_job_partitions_def` through the run creation callsites

We need access to the partitions definition of the assets/job in order to build the target partitions subset from the partition range. The alternative would be to serialize a list of targeted partition keys on the execution plan snapshot for partition ranged runs, which bloats the snapshot.

There are two different ways runs are created from jobs:

The execution plan is built from the code location
Testing contexts; the job definition is passed into execute_in_process and the execution plan is created from there

In the first case, the code location holds the external repository data in memory. We can then fetch the external partitions definition of the job, and thread this into create_run. We could also fetch the partitions definition per-asset from the code location, but that feels extraneous given that we know (1) the partitions definition on the job and (2) whether each asset is partitioned or not.

In the second case, we don't have access to the code location, so instead this PR constructs the external partitions definition data from the job def and threads that into create_run.

Testing

Tested locally on cloud and OSS.

Corresponding internal PR: https://github.com/dagster-io/internal/pull/7615

clairelin135 · 2023-11-27T21:27:25Z

Current dependencies on/for this PR:

master
- PR Enable serializing partitions subset on materialization planned event #18418
  - PR Log materialization planned event for run with partition range #18305 👈

This stack of pull requests is managed by Graphite.

github-actions · 2023-12-01T06:23:15Z

Deploy preview for dagit-storybook ready!

✅ Preview
https://dagit-storybook-iq9itvkm8-elementl.vercel.app
https://claire-status-single-run-backfill.components-storybook.dagster-docs.io

Built with commit 315cc58.
This pull request is being automatically deployed with vercel-action

github-actions · 2023-12-01T06:23:49Z

Deploy preview for dagster-docs ready!

Preview available at https://dagster-docs-jeqz8rkv0-elementl.vercel.app
https://claire-status-single-run-backfill.dagster.dagster-docs.io

Direct link to changed pages:

github-actions · 2023-12-01T06:25:53Z

Deploy preview for dagit-core-storybook ready!

✅ Preview
https://dagit-core-storybook-3e7v2d188-elementl.vercel.app
https://claire-status-single-run-backfill.core-storybook.dagster-docs.io

Built with commit 315cc58.
This pull request is being automatically deployed with vercel-action

clairelin135 · 2023-12-04T22:04:01Z

Must be merged with internal PR: https://github.com/dagster-io/internal/pull/7615

sryza · 2023-12-04T23:53:56Z

python_modules/dagster/dagster/_core/instance/__init__.py

@@ -1303,8 +1319,12 @@ def _ensure_persisted_execution_plan_snapshot(
        return execution_plan_snapshot_id

    def _log_asset_planned_events(
-        self, dagster_run: DagsterRun, execution_plan_snapshot: "ExecutionPlanSnapshot"
+        self,


_log_asset_planned_events is a pretty long function already and it's getting significantly longer here. Is there a way to break it up a little?

yup, split the materialization planned events logic out into a different function

Consider putting the event generation on a staticmethod on DagsterEvent?

sryza · 2023-12-04T23:55:11Z

python_modules/dagster/dagster/_core/host_representation/external_data.py

@@ -1900,6 +1908,24 @@ def external_schedule_data_from_def(schedule_def: ScheduleDefinition) -> Externa
    )


+def can_build_external_partitions_definition_from_def(partitions_def: PartitionsDefinition):


return type annotation plz

sryza · 2023-12-04T23:56:45Z

python_modules/dagster/dagster/_core/instance/__init__.py

+                job_def.partitions_def
+            )
+            if job_def.partitions_def
+            and can_build_external_partitions_definition_from_def(job_def.partitions_def)


Without this check, we would just error, right? What's wrong with that? Might be better to find out early than discover a problem later along the way?

This check guards against the backcompat dynamic partitions case (a function that returns partition keys)

Without it we raise an error, but what we want is to not yield the asset materialization planned events w/ a subset

This check guards against the backcompat dynamic partitions case (a function that returns partition keys)

Is it currently possible to launch a range backfill using function-based dynamic partitions? I would be comfortable saying that's not allowed.

It's not possible from the UI but I don't think we gate it if you were to yield a run request with tags or something like that.

Agree that we can make it not allowed, I don't think anyone is doing that anyway

sryza · 2023-12-04T23:59:32Z

python_modules/dagster/dagster/_core/instance/__init__.py

+                                and check.not_none(output.properties).is_asset_partitioned
+                            ):
+                                partitions_def = (
+                                    external_partitions_def_data.get_partitions_definition()


This might not be super cheap, right? For static partitions it's O(# partitions), and for time window partitions it involves some pendulum stuff. Thoughts on caching the PartitionsDefinition in pipeline_and_execution_plan_cache instead of the ExternalPartitionsDefinitionData?

Another benefit of this would be spreading ExternalPartitionsDefinitionData to fewer places – longer term it could be nice to get rid of that and just use @whitelist_for_serdes on PartitionsDefinition instead.

Thoughts on caching the PartitionsDefinition in pipeline_and_execution_plan_cache instead of the ExternalPartitionsDefinitionData?

You mean passing around the PartitionsDefinition instead of the external partitions definition everywhere, right? I'm good with this, this probably saves us some partitions def roundtrips in the asset backfill case

You mean passing around the PartitionsDefinition instead of the external partitions definition everywhere, right?

Exactly

Wondering if we need to worry about it being unclear that backcompat dynamic partitions defs won't be passed into create_run...

Could name the new param assets_partitions_def? What do you think?

Could name the new param assets_partitions_def? What do you think?

That's a good idea

ok, Ive gone ahead with the rename

though I changed it to asset_job_partitions_def since I think that is a better description

prha · 2023-12-05T17:20:29Z

python_modules/dagster-graphql/dagster_graphql/implementation/execution/run_lifecycle.py

@@ -110,6 +112,7 @@ def create_valid_pipeline_run(
        status=DagsterRunStatus.NOT_STARTED,
        external_job_origin=external_pipeline.get_external_origin(),
        job_code_origin=external_pipeline.get_python_origin(),
+        asset_job_partitions_def=code_location.get_asset_job_partitions_def(external_pipeline),


What's the size of the partitions def typically? This can still be quite large if it's a static partitions def but small in all other cases?

For static partitions def, would we just want the serialization of the partition keys. Maybe micro-optimization...

If the point of partition ranges is just to help with write throughput, then should there be some threshold of how large the ranges are?

This can still be quite large if it's a static partitions def but small in all other cases?

Yes, this is true...

Are we worried about performance if the size of the partitions def is large? Between serializing a large execution plan versus having a large in-memory partitions def being passed around, I lean toward having the in-memory object

If the point of partition ranges is just to help with write throughput, then should there be some threshold of how large the ranges are?

I think its to minimize the # of created runs (maybe the same as what you mean by write throughput).

I think it's sensible to limit the size of a range, though the range should be large enough to target all partitions of a reasonably-sized partitions def.

Maybe we need to first decide a limit on the size of partitions defs to enforce this?

prha · 2023-12-05T17:28:48Z

python_modules/dagster/dagster/_core/instance/__init__.py

+
+            # For now, yielding materialization planned events for single run backfills
+            # is only supported on cloud
+            if self.is_cloud_instance and check.not_none(output.properties).is_asset_partitioned:


Is the flag necessary? Could we always populate the subset but just update the partitions table in the storage layer?

We subclass DagsterInstance in cloud, and use it as a place to override default behavior. Adding an is_cloud_instance flag in OSS feels off to me and could promote more confusing intertwining of logic.

This is a good call, I believe OSS functions the same regardless of whether partition_key=None vs partitions_subset is populated

prha · 2023-12-05T17:31:23Z

python_modules/dagster/dagster/_core/instance/__init__.py

@@ -1303,8 +1319,12 @@ def _ensure_persisted_execution_plan_snapshot(
        return execution_plan_snapshot_id

    def _log_asset_planned_events(
-        self, dagster_run: DagsterRun, execution_plan_snapshot: "ExecutionPlanSnapshot"
+        self,


Consider putting the event generation on a staticmethod on DagsterEvent?

prha · 2023-12-05T17:47:15Z

python_modules/dagster/dagster_tests/core_tests/test_asset_events.py

@@ -166,3 +171,103 @@ def my_other_asset(my_asset):
            )
        )
        assert record.event_log_entry.dagster_event.event_specific_data.partition is None
+
+
+def test_subset_on_asset_materialization_planned_event_for_single_run_backfill_allowed():


Can you add tests flexing how some of the asset partition reads are affected?

get_materialized_partitions, get_latest_storage_id_by_partition, get_latest_tags_by_partition, get_latest_asset_partition_materialization_attempts_without_materializations

I've added tests for these methods

Except get_latest_tags_by_partition, that method filters by event type and materialization planned events can't contain tags right now

sryza

This looks good to me

sryza · 2023-12-11T21:47:09Z

python_modules/dagster/dagster/_core/instance/__init__.py

+        )
+
+        partition_tag = dagster_run.tags.get(PARTITION_NAME_TAG)
+        partition_range_start, partition_range_end = (


Not blocking feedback, but pushing this down into a utility, or method on DagsterRun could help minimize the number of places in the codebase we need to parse these tags.

That's a good call, though would prefer to split this out into a separate PR to fully refactor out existing references to these tags

That makes sense

…nned-event]

…ore-partitions-subset-on-planned-event]

…on-planned-event]

feedback addressed

clairelin135 force-pushed the claire/status-single-run-backfill branch 5 times, most recently from 2ed5c9f to 37c822b Compare December 1, 2023 06:20

clairelin135 mentioned this pull request Dec 1, 2023

Enable serializing partitions subset on materialization planned event #18418

Merged

clairelin135 force-pushed the claire/status-single-run-backfill branch from 37c822b to 315cc58 Compare December 2, 2023 00:31

clairelin135 changed the base branch from master to 11-29-serialize_subset_on_event_first_stab December 2, 2023 01:30

clairelin135 force-pushed the claire/status-single-run-backfill branch from 315cc58 to fbd8901 Compare December 2, 2023 01:31

clairelin135 force-pushed the 11-29-serialize_subset_on_event_first_stab branch 3 times, most recently from ee96526 to c12261a Compare December 2, 2023 01:41

clairelin135 force-pushed the claire/status-single-run-backfill branch 2 times, most recently from 037fc5a to 3d65c2b Compare December 4, 2023 22:00

clairelin135 changed the title ~~Log materialization planned events for run with partition range~~ Log materialization planned event for run with partition range Dec 4, 2023

clairelin135 marked this pull request as ready for review December 4, 2023 22:07

clairelin135 mentioned this pull request Dec 4, 2023

failed partitions single-run backfills don't show up in asset nodes or on details page #18020

Closed

clairelin135 requested review from sryza, prha and OwenKephart December 4, 2023 22:27

sryza reviewed Dec 4, 2023

View reviewed changes

clairelin135 force-pushed the 11-29-serialize_subset_on_event_first_stab branch 2 times, most recently from 8214fd2 to a7211b9 Compare December 5, 2023 01:07

clairelin135 force-pushed the claire/status-single-run-backfill branch 2 times, most recently from caf5788 to 47d613a Compare December 5, 2023 01:51

clairelin135 force-pushed the claire/status-single-run-backfill branch 4 times, most recently from caddca9 to e3a0ab0 Compare December 5, 2023 03:32

prha previously requested changes Dec 5, 2023

View reviewed changes

clairelin135 force-pushed the 11-29-serialize_subset_on_event_first_stab branch from 21ca2e7 to 8bffa6b Compare December 8, 2023 23:20

clairelin135 force-pushed the claire/status-single-run-backfill branch from e3a0ab0 to 2ddbb10 Compare December 9, 2023 03:07

clairelin135 requested review from prha and sryza December 9, 2023 03:31

sryza approved these changes Dec 11, 2023

View reviewed changes

clairelin135 force-pushed the 11-29-serialize_subset_on_event_first_stab branch from 8bffa6b to 9007970 Compare December 12, 2023 21:54

clairelin135 force-pushed the claire/status-single-run-backfill branch from 2ddbb10 to a28343f Compare December 12, 2023 22:04

Base automatically changed from 11-29-serialize_subset_on_event_first_stab to master December 12, 2023 22:24

clairelin135 force-pushed the claire/status-single-run-backfill branch from a28343f to 18d362a Compare December 12, 2023 22:25

clairelin135 added 9 commits December 28, 2023 09:15

first stab

61ea327

update to use partitions_def_data

36a21fc

add instance changes

67aa6c7

add tests

c38f2b2

clean up [INTERNAL_BRANCH=11-29-claire/store-partitions-subset-on-pla…

5b80ccb

…nned-event]

change to use partitions_def instead [INTERNAL_BRANCH=11-29-claire/st…

d505442

…ore-partitions-subset-on-planned-event]

[INTERNAL_BRANCH=11-29-claire/store-partitions-subset-on-planned-event]

d10e60e

pr feedback

aa88d4d

add more tests [INTERNAL_BRANCH=11-29-claire/store-partitions-subset-…

24c9557

…on-planned-event]

clairelin135 force-pushed the claire/status-single-run-backfill branch from 18d362a to 24c9557 Compare December 28, 2023 18:33

clairelin135 merged commit 04b592e into master Dec 28, 2023
1 check passed

clairelin135 deleted the claire/status-single-run-backfill branch December 28, 2023 21:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log materialization planned event for run with partition range #18305

Log materialization planned event for run with partition range #18305

clairelin135 commented Nov 27, 2023 •

edited

Loading

clairelin135 commented Nov 27, 2023 •

edited

Loading

github-actions bot commented Dec 1, 2023 •

edited

Loading

github-actions bot commented Dec 1, 2023 •

edited

Loading

github-actions bot commented Dec 1, 2023 •

edited

Loading

clairelin135 commented Dec 4, 2023

sryza Dec 4, 2023

clairelin135 Dec 5, 2023

prha Dec 5, 2023

clairelin135 Dec 9, 2023

sryza Dec 4, 2023

sryza Dec 4, 2023

clairelin135 Dec 5, 2023 •

edited

Loading

sryza Dec 5, 2023

clairelin135 Dec 5, 2023

sryza Dec 4, 2023

clairelin135 Dec 5, 2023

sryza Dec 5, 2023

clairelin135 Dec 5, 2023

sryza Dec 5, 2023

clairelin135 Dec 5, 2023

prha Dec 5, 2023

prha Dec 5, 2023

clairelin135 Dec 9, 2023

clairelin135 Dec 9, 2023

prha Dec 5, 2023

clairelin135 Dec 9, 2023

prha Dec 5, 2023

prha Dec 5, 2023

clairelin135 Dec 9, 2023

sryza left a comment

sryza Dec 11, 2023

clairelin135 Dec 12, 2023

sryza Dec 12, 2023

		@@ -1900,6 +1908,24 @@ def external_schedule_data_from_def(schedule_def: ScheduleDefinition) -> Externa
		)


		def can_build_external_partitions_definition_from_def(partitions_def: PartitionsDefinition):

Log materialization planned event for run with partition range #18305

Log materialization planned event for run with partition range #18305

Conversation

clairelin135 commented Nov 27, 2023 • edited Loading

1. Log a materialization planned event containing a partitions subset for single-run backfills upon run creation

2. Threads an assets_job_partitions_def through the run creation callsites

Testing

clairelin135 commented Nov 27, 2023 • edited Loading

github-actions bot commented Dec 1, 2023 • edited Loading

github-actions bot commented Dec 1, 2023 • edited Loading

github-actions bot commented Dec 1, 2023 • edited Loading

clairelin135 commented Dec 4, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clairelin135 Dec 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sryza left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clairelin135 commented Nov 27, 2023 •

edited

Loading

2. Threads an `assets_job_partitions_def` through the run creation callsites

clairelin135 commented Nov 27, 2023 •

edited

Loading

github-actions bot commented Dec 1, 2023 •

edited

Loading

github-actions bot commented Dec 1, 2023 •

edited

Loading

github-actions bot commented Dec 1, 2023 •

edited

Loading

clairelin135 Dec 5, 2023 •

edited

Loading