[7/n subset refactor] Use new asset backfill data serialization format #17929

clairelin135 · 2023-11-11T00:42:19Z

At long last, we have arrived to the final part in this stack. This PR migrates PartitionBackfill logic to use the new serialization format of AssetBackfillData.

When new backfills are created from now on, the UI will be able to display the asset backfill page even if the partitions defs are changed/removed. For old backfills, the UI continue to show the "partitions def has changed" message, and the asset backfill page will be blank.

Description of changes:

Adds an additional asset_backfill_data field to PartitionBackfill
- Asset backfills from now on will use this field with the new serialization
- Existing backfills will continue to use serialized_asset_backfill_data. We could force the daemon to migrate these objects mid-backfill, but that value add is pretty low. It also improves debug-ability by forcing old backfills to use the old serialization, and new backfills to use the new serialization.
Serializes the unique ID of each partitions def in a field on AssetBackfillData. Adds a new method in asset backfill execution that uses the unique ID to check if partitions defs have changed, in which case we should stop execution.
- This previously existed in the old serialization version of AssetGraphSubset, but was unfortunately duplicated across each subset type (materialized, in-progress, failed)
Adds tests cases to cover this new surface area

clairelin135 · 2023-11-11T00:42:57Z

Current dependencies on/for this PR:

master
- PR [1/n subset refactor] Split TimeWindowPartitionsSubset #17684
  - PR [2/n subset refactor] Make TimeWindowPartitionsDefinition serializable #17660
    - PR [3/n subset refactor] Add whitelist_for_serdes to TimeWindowPartitionsSubset #17702
      - PR [4/n subset refactor] Add whitelist_for_serdes to DefaultPartitionsSubset #17703
        
        PR [5/n subset refactor] [serdes] Enable serializing mappings with non-scalar keys #18057
        
        PR [6/n subset refactor] Serialize AssetGraphSubset and AssetBackfillData with whitelist_for_serdes #17844
        
        PR [7/n subset refactor] Use new asset backfill data serialization format #17929 👈

This stack of pull requests is managed by Graphite.

sryza

We could force the daemon to migrate these objects mid-backfill, but that value add is pretty low.

I definitely agree

sryza · 2023-11-22T16:32:00Z

python_modules/dagster/dagster/_core/execution/asset_backfill.py

+            ("requested_subset", AssetGraphSubset),
+            ("failed_and_downstream_subset", AssetGraphSubset),
+            ("backfill_start_time", datetime),
+            ("partitions_def_ids_by_asset_key", Optional[Mapping[AssetKey, str]]),


Why exactly do we need this? Because we want to fail the backfill if the partitions def changes during its execution?

Is this definitely still necessary now?

Because we want to fail the backfill if the partitions def changes during its execution?

Yes

The alternative would be checking if each targeted partition is still existent in the partitions def

sryza · 2023-11-22T16:35:29Z

python_modules/dagster/dagster/_core/execution/backfill.py

+            self.serialized_asset_backfill_data is not None or self.asset_backfill_data is not None
+        )
+
+    def get_asset_backfill_data(self, asset_graph: AssetGraph) -> Optional[AssetBackfillData]:


I find the behavior of this function a little unexpected with respect to its signature. Just based on the signature, I would expect it to return None in the case where there is no backfill data, not in the case where there's a deserialization issue. Thoughts on still handling the error inside the caller instead of here?

Seems like we then might be able to use this at [1] instead of having to directly call from_serialized there?

sryza · 2023-11-22T16:36:45Z

python_modules/dagster/dagster/_utils/caching_instance_queryer.py

-                self._logger.warning(
-                    f"Not considering assets in backfill {asset_backfill.backfill_id} since its"
-                    " data could not be deserialized"
+            if asset_backfill.serialized_asset_backfill_data:


sryza · 2023-11-22T16:39:02Z

python_modules/dagster/dagster/_core/execution/asset_backfill.py

-        previous_asset_backfill_data = AssetBackfillData.from_serialized(
-            backfill.serialized_asset_backfill_data, asset_graph, backfill.backfill_timestamp
-        )
+        if backfill.serialized_asset_backfill_data:


Could it make sense to consolidate parts of this implementation with get_asset_backfill_data?

python_modules/dagster/dagster/_core/execution/asset_backfill.py

clairelin135 · 2023-11-22T22:32:38Z

@sryza This PR has been updated to address PR feedback above. The main update is removing partitions_defs_ids_by_asset_key as discussed. The new behavior is:

For time partitioned assets, raise an error if the partitions def is changed
For other partitioned assets, raise an error if a targeted partition is removed

sryza

One last comment. Otherwise looks good to go!

sryza · 2023-11-22T23:17:05Z

python_modules/dagster/dagster/_core/execution/asset_backfill.py

+        else:
+            # Check that all target partitions still exist. If so, the backfill can continue.
+            for target_key in target_partitions_subset.get_partition_keys():
+                if not partitions_def.has_partition_key(


Will this be O(n^2)? Is there a way to make it O(n)?

Assuming that N is the number of partitions on the partitions def. It's not O(N^2) -- each partitions def has its own has_partition_key method that is optimized to not require fetching all partitions

For StaticPartitionsDefinition, my read is that has_partition_key looks inside self._partitions, which is a sequence and thus O(# partitions) to check if it has a value. Is that off base?

And for dynamic partitions, will this not result in O(target_partitions_subset.get_partition_keys()) calls to the database to get all partition keys?

Ah.. yeah, looks like for static partitions defs the call will be O(N^2)

And for dynamic partitions, will this not result in O(target_partitions_subset.get_partition_keys()) calls to the database to get all partition keys?

And yes, this is true

I've amended the behavior to instead use the AllPartitionsSubset to build a subset of all valid partitions, then compare against that subset

This should be O(N) now

sryza

Ship it!

clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch from c8b3b0c to 9c0eed0 Compare November 13, 2023 21:12

clairelin135 force-pushed the 11-09-claire/use-new-asset-backfill-data-serialization branch 5 times, most recently from ada6238 to 06c6bfa Compare November 14, 2023 00:21

clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch from fcefec8 to e701159 Compare November 14, 2023 17:54

clairelin135 force-pushed the 11-09-claire/use-new-asset-backfill-data-serialization branch 2 times, most recently from f482441 to 0e7cab0 Compare November 14, 2023 19:15

clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch 2 times, most recently from 02fe3e8 to 3cc50a0 Compare November 16, 2023 00:34

clairelin135 mentioned this pull request Nov 16, 2023

[5/n subset refactor] [serdes] Enable serializing mappings with non-scalar keys #18057

Merged

clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch from 3cc50a0 to baf19c5 Compare November 16, 2023 19:25

clairelin135 changed the title ~~[6/n subset refactor] Use new asset backfill data serialization format~~ [7/n subset refactor] Use new asset backfill data serialization format Nov 16, 2023

clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch from baf19c5 to 73a020f Compare November 16, 2023 20:49

clairelin135 force-pushed the 11-09-claire/use-new-asset-backfill-data-serialization branch from 0e7cab0 to a64597b Compare November 16, 2023 20:49

clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch 2 times, most recently from 68d152d to 329f806 Compare November 17, 2023 00:11

clairelin135 force-pushed the 11-09-claire/use-new-asset-backfill-data-serialization branch from a64597b to 96a6b69 Compare November 17, 2023 00:11

clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch from 329f806 to 6850753 Compare November 20, 2023 23:46

clairelin135 force-pushed the 11-09-claire/use-new-asset-backfill-data-serialization branch from f5e0097 to 90bbff3 Compare November 20, 2023 23:46

clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch from 6850753 to 7d3db96 Compare November 21, 2023 19:46

clairelin135 force-pushed the 11-09-claire/use-new-asset-backfill-data-serialization branch from e490f29 to 5732fb7 Compare November 21, 2023 19:46

clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch from 7d3db96 to 323344d Compare November 21, 2023 21:10

clairelin135 force-pushed the 11-09-claire/use-new-asset-backfill-data-serialization branch from f53075b to 9a28f89 Compare November 21, 2023 22:50

clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch from 323344d to 66d2810 Compare November 21, 2023 22:50

clairelin135 force-pushed the 11-09-claire/use-new-asset-backfill-data-serialization branch from 9a28f89 to f1a64ed Compare November 21, 2023 22:51

clairelin135 marked this pull request as ready for review November 21, 2023 23:50

clairelin135 requested review from sryza and OwenKephart November 22, 2023 00:12

sryza reviewed Nov 22, 2023

View reviewed changes

clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch from 66d2810 to 92b58a5 Compare November 22, 2023 19:58

clairelin135 force-pushed the 11-09-claire/use-new-asset-backfill-data-serialization branch 3 times, most recently from 5e8bfc8 to fc79f17 Compare November 22, 2023 22:12

Base automatically changed from 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable to master November 22, 2023 22:30

clairelin135 added 10 commits November 22, 2023 14:31

claire/use-new-asset-backfill-data-serialization

4c2f45a

use named tuple serializer instead of field serializer

f3244b4

compare serializable partitions defs ids

4ee4935

backcompat arg

43ac735

use pendulum datetimes

657238d

clean up

0164b09

adjust to use mapping keyed by asset key

e3d4f25

add more tests

f855495

remove partitions defs id serialization

0888b69

pr feedback

05a0f5d

clairelin135 force-pushed the 11-09-claire/use-new-asset-backfill-data-serialization branch from fc79f17 to 05a0f5d Compare November 22, 2023 22:31

clairelin135 requested a review from sryza November 22, 2023 22:32

sryza reviewed Nov 22, 2023

View reviewed changes

avoid using has_partition_key

5d44bbf

sryza approved these changes Nov 27, 2023

View reviewed changes

clairelin135 merged commit b8ebe0a into master Nov 27, 2023

clairelin135 deleted the 11-09-claire/use-new-asset-backfill-data-serialization branch November 27, 2023 17:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[7/n subset refactor] Use new asset backfill data serialization format #17929

[7/n subset refactor] Use new asset backfill data serialization format #17929

clairelin135 commented Nov 11, 2023 •

edited

Loading

clairelin135 commented Nov 11, 2023 •

edited

Loading

sryza left a comment

sryza Nov 22, 2023

clairelin135 Nov 22, 2023

sryza Nov 22, 2023

sryza Nov 22, 2023

sryza Nov 22, 2023

clairelin135 commented Nov 22, 2023

sryza left a comment

sryza Nov 22, 2023

clairelin135 Nov 23, 2023

sryza Nov 23, 2023

clairelin135 Nov 27, 2023

clairelin135 Nov 27, 2023

sryza left a comment

[7/n subset refactor] Use new asset backfill data serialization format #17929

[7/n subset refactor] Use new asset backfill data serialization format #17929

Conversation

clairelin135 commented Nov 11, 2023 • edited Loading

clairelin135 commented Nov 11, 2023 • edited Loading

sryza left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clairelin135 commented Nov 22, 2023

sryza left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sryza left a comment

Choose a reason for hiding this comment

clairelin135 commented Nov 11, 2023 •

edited

Loading

clairelin135 commented Nov 11, 2023 •

edited

Loading