Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log materialization planned event for run with partition range #18305

Merged
merged 9 commits into from
Dec 28, 2023

Conversation

clairelin135
Copy link
Contributor

@clairelin135 clairelin135 commented Nov 27, 2023

This PR enables the UI to correctly display failed/in progress status for single-run backfills, in cloud.

This includes two major changes:

1. Log a materialization planned event containing a partitions subset for single-run backfills upon run creation

If all of the below conditions apply, we log an partitions subset planned event for each asset:

  • Executing instance is a cloud instance
  • Run targets a partition range
  • Asset is partitioned (available on execution plan snapshot)

Otherwise, we fallback to the status quo behavior:

  1. If run has a single partition, log a planned event w/ a partition key for each asset
  2. If run targets a partition range, log a planned event w/ partition=None. We still log a planned event in this case because in certain places (i.e. asset backfills) we query planned events to see which assets are planned to execute.

2. Threads an assets_job_partitions_def through the run creation callsites

We need access to the partitions definition of the assets/job in order to build the target partitions subset from the partition range. The alternative would be to serialize a list of targeted partition keys on the execution plan snapshot for partition ranged runs, which bloats the snapshot.

There are two different ways runs are created from jobs:

  1. The execution plan is built from the code location
  2. Testing contexts; the job definition is passed into execute_in_process and the execution plan is created from there

In the first case, the code location holds the external repository data in memory. We can then fetch the external partitions definition of the job, and thread this into create_run. We could also fetch the partitions definition per-asset from the code location, but that feels extraneous given that we know (1) the partitions definition on the job and (2) whether each asset is partitioned or not.

In the second case, we don't have access to the code location, so instead this PR constructs the external partitions definition data from the job def and threads that into create_run.

Testing

Tested locally on cloud and OSS.

Corresponding internal PR: https://github.com/dagster-io/internal/pull/7615

@clairelin135
Copy link
Contributor Author

clairelin135 commented Nov 27, 2023

@clairelin135 clairelin135 force-pushed the claire/status-single-run-backfill branch 5 times, most recently from 2ed5c9f to 37c822b Compare December 1, 2023 06:20
Copy link

github-actions bot commented Dec 1, 2023

Deploy preview for dagit-storybook ready!

✅ Preview
https://dagit-storybook-iq9itvkm8-elementl.vercel.app
https://claire-status-single-run-backfill.components-storybook.dagster-docs.io

Built with commit 315cc58.
This pull request is being automatically deployed with vercel-action

Copy link

github-actions bot commented Dec 1, 2023

Deploy preview for dagit-core-storybook ready!

✅ Preview
https://dagit-core-storybook-3e7v2d188-elementl.vercel.app
https://claire-status-single-run-backfill.core-storybook.dagster-docs.io

Built with commit 315cc58.
This pull request is being automatically deployed with vercel-action

@clairelin135 clairelin135 force-pushed the claire/status-single-run-backfill branch from 37c822b to 315cc58 Compare December 2, 2023 00:31
@clairelin135 clairelin135 changed the base branch from master to 11-29-serialize_subset_on_event_first_stab December 2, 2023 01:30
@clairelin135 clairelin135 force-pushed the claire/status-single-run-backfill branch from 315cc58 to fbd8901 Compare December 2, 2023 01:31
@clairelin135 clairelin135 force-pushed the 11-29-serialize_subset_on_event_first_stab branch 3 times, most recently from ee96526 to c12261a Compare December 2, 2023 01:41
@clairelin135 clairelin135 force-pushed the claire/status-single-run-backfill branch 2 times, most recently from 037fc5a to 3d65c2b Compare December 4, 2023 22:00
@clairelin135
Copy link
Contributor Author

Must be merged with internal PR: https://github.com/dagster-io/internal/pull/7615

@clairelin135 clairelin135 changed the title Log materialization planned events for run with partition range Log materialization planned event for run with partition range Dec 4, 2023
@clairelin135 clairelin135 marked this pull request as ready for review December 4, 2023 22:07
@@ -1303,8 +1319,12 @@ def _ensure_persisted_execution_plan_snapshot(
return execution_plan_snapshot_id

def _log_asset_planned_events(
self, dagster_run: DagsterRun, execution_plan_snapshot: "ExecutionPlanSnapshot"
self,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_log_asset_planned_events is a pretty long function already and it's getting significantly longer here. Is there a way to break it up a little?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, split the materialization planned events logic out into a different function

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider putting the event generation on a staticmethod on DagsterEvent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, added

@@ -1900,6 +1908,24 @@ def external_schedule_data_from_def(schedule_def: ScheduleDefinition) -> Externa
)


def can_build_external_partitions_definition_from_def(partitions_def: PartitionsDefinition):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return type annotation plz

job_def.partitions_def
)
if job_def.partitions_def
and can_build_external_partitions_definition_from_def(job_def.partitions_def)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this check, we would just error, right? What's wrong with that? Might be better to find out early than discover a problem later along the way?

Copy link
Contributor Author

@clairelin135 clairelin135 Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check guards against the backcompat dynamic partitions case (a function that returns partition keys)

Without it we raise an error, but what we want is to not yield the asset materialization planned events w/ a subset

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check guards against the backcompat dynamic partitions case (a function that returns partition keys)

Is it currently possible to launch a range backfill using function-based dynamic partitions? I would be comfortable saying that's not allowed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not possible from the UI but I don't think we gate it if you were to yield a run request with tags or something like that.

Agree that we can make it not allowed, I don't think anyone is doing that anyway

and check.not_none(output.properties).is_asset_partitioned
):
partitions_def = (
external_partitions_def_data.get_partitions_definition()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not be super cheap, right? For static partitions it's O(# partitions), and for time window partitions it involves some pendulum stuff. Thoughts on caching the PartitionsDefinition in pipeline_and_execution_plan_cache instead of the ExternalPartitionsDefinitionData?

Another benefit of this would be spreading ExternalPartitionsDefinitionData to fewer places – longer term it could be nice to get rid of that and just use @whitelist_for_serdes on PartitionsDefinition instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on caching the PartitionsDefinition in pipeline_and_execution_plan_cache instead of the ExternalPartitionsDefinitionData?

You mean passing around the PartitionsDefinition instead of the external partitions definition everywhere, right? I'm good with this, this probably saves us some partitions def roundtrips in the asset backfill case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean passing around the PartitionsDefinition instead of the external partitions definition everywhere, right?

Exactly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we need to worry about it being unclear that backcompat dynamic partitions defs won't be passed into create_run...

Could name the new param assets_partitions_def? What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could name the new param assets_partitions_def? What do you think?

That's a good idea

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, Ive gone ahead with the rename

though I changed it to asset_job_partitions_def since I think that is a better description

@clairelin135 clairelin135 force-pushed the 11-29-serialize_subset_on_event_first_stab branch 2 times, most recently from 8214fd2 to a7211b9 Compare December 5, 2023 01:07
@clairelin135 clairelin135 force-pushed the claire/status-single-run-backfill branch 2 times, most recently from caf5788 to 47d613a Compare December 5, 2023 01:51
@clairelin135 clairelin135 force-pushed the claire/status-single-run-backfill branch 4 times, most recently from caddca9 to e3a0ab0 Compare December 5, 2023 03:32
prha
prha previously requested changes Dec 5, 2023
@@ -110,6 +112,7 @@ def create_valid_pipeline_run(
status=DagsterRunStatus.NOT_STARTED,
external_job_origin=external_pipeline.get_external_origin(),
job_code_origin=external_pipeline.get_python_origin(),
asset_job_partitions_def=code_location.get_asset_job_partitions_def(external_pipeline),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the size of the partitions def typically? This can still be quite large if it's a static partitions def but small in all other cases?

For static partitions def, would we just want the serialization of the partition keys. Maybe micro-optimization...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the point of partition ranges is just to help with write throughput, then should there be some threshold of how large the ranges are?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can still be quite large if it's a static partitions def but small in all other cases?

Yes, this is true...

Are we worried about performance if the size of the partitions def is large? Between serializing a large execution plan versus having a large in-memory partitions def being passed around, I lean toward having the in-memory object

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the point of partition ranges is just to help with write throughput, then should there be some threshold of how large the ranges are?

I think its to minimize the # of created runs (maybe the same as what you mean by write throughput).

I think it's sensible to limit the size of a range, though the range should be large enough to target all partitions of a reasonably-sized partitions def.

Maybe we need to first decide a limit on the size of partitions defs to enforce this?


# For now, yielding materialization planned events for single run backfills
# is only supported on cloud
if self.is_cloud_instance and check.not_none(output.properties).is_asset_partitioned:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the flag necessary? Could we always populate the subset but just update the partitions table in the storage layer?

We subclass DagsterInstance in cloud, and use it as a place to override default behavior. Adding an is_cloud_instance flag in OSS feels off to me and could promote more confusing intertwining of logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good call, I believe OSS functions the same regardless of whether partition_key=None vs partitions_subset is populated

@@ -1303,8 +1319,12 @@ def _ensure_persisted_execution_plan_snapshot(
return execution_plan_snapshot_id

def _log_asset_planned_events(
self, dagster_run: DagsterRun, execution_plan_snapshot: "ExecutionPlanSnapshot"
self,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider putting the event generation on a staticmethod on DagsterEvent?

@@ -166,3 +171,103 @@ def my_other_asset(my_asset):
)
)
assert record.event_log_entry.dagster_event.event_specific_data.partition is None


def test_subset_on_asset_materialization_planned_event_for_single_run_backfill_allowed():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add tests flexing how some of the asset partition reads are affected?

get_materialized_partitions, get_latest_storage_id_by_partition, get_latest_tags_by_partition, get_latest_asset_partition_materialization_attempts_without_materializations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added tests for these methods

Except get_latest_tags_by_partition, that method filters by event type and materialization planned events can't contain tags right now

@clairelin135 clairelin135 force-pushed the 11-29-serialize_subset_on_event_first_stab branch from 21ca2e7 to 8bffa6b Compare December 8, 2023 23:20
@clairelin135 clairelin135 force-pushed the claire/status-single-run-backfill branch from e3a0ab0 to 2ddbb10 Compare December 9, 2023 03:07
@clairelin135 clairelin135 requested review from prha and sryza December 9, 2023 03:31
Copy link
Contributor

@sryza sryza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me

)

partition_tag = dagster_run.tags.get(PARTITION_NAME_TAG)
partition_range_start, partition_range_end = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking feedback, but pushing this down into a utility, or method on DagsterRun could help minimize the number of places in the codebase we need to parse these tags.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good call, though would prefer to split this out into a separate PR to fully refactor out existing references to these tags

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense

@clairelin135 clairelin135 force-pushed the 11-29-serialize_subset_on_event_first_stab branch from 8bffa6b to 9007970 Compare December 12, 2023 21:54
@clairelin135 clairelin135 force-pushed the claire/status-single-run-backfill branch from 2ddbb10 to a28343f Compare December 12, 2023 22:04
Base automatically changed from 11-29-serialize_subset_on_event_first_stab to master December 12, 2023 22:24
@clairelin135 clairelin135 force-pushed the claire/status-single-run-backfill branch from a28343f to 18d362a Compare December 12, 2023 22:25
@clairelin135 clairelin135 force-pushed the claire/status-single-run-backfill branch from 18d362a to 24c9557 Compare December 28, 2023 18:33
@clairelin135 clairelin135 dismissed prha’s stale review December 28, 2023 21:03

feedback addressed

@clairelin135 clairelin135 merged commit 04b592e into master Dec 28, 2023
1 check passed
@clairelin135 clairelin135 deleted the claire/status-single-run-backfill branch December 28, 2023 21:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants