[6/n subset refactor] Serialize AssetGraphSubset and AssetBackfillData with whitelist_for_serdes #17844

clairelin135 · 2023-11-08T23:05:45Z

This PR makes AssetGraphSubset and AssetBackfillData serializable via whitelist_for_serdes.

This involves the following changes:

converts AssetGraphSubset and AssetBackfillData into NamedTuples
- adds a custom serializer to convert PartitionKeysTimeWindowPartitionsSubset to TimeWindowsPartitionsSubset
removes the asset graph property from AssetGraphSubset as it is not serializable. This has a cascading effect:
- callsites where we previously called asset_graph_subset.asset_graph must instead have an asset graph passed in
- previously we could build empty partitions subsets when needed within AssetGraphSubset (i.e. within __or__). This logic now must be updated to handle cases where a partitions subset is currently None
- now AssetGraphSubset or, and, and sub (|, &, -) operations cannot operate directly against sets of AssetKeyPartitionKeys, since the asset graph is required to build subsets from these AssetKeyPartitionKeys

clairelin135 · 2023-11-08T23:06:00Z

Current dependencies on/for this PR:

master
- PR [1/n subset refactor] Split TimeWindowPartitionsSubset #17684
  - PR [2/n subset refactor] Make TimeWindowPartitionsDefinition serializable #17660
    - PR [3/n subset refactor] Add whitelist_for_serdes to TimeWindowPartitionsSubset #17702
      - PR [4/n subset refactor] Add whitelist_for_serdes to DefaultPartitionsSubset #17703
        
        PR [5/n subset refactor] [serdes] Enable serializing mappings with non-scalar keys #18057
        
        PR [6/n subset refactor] Serialize AssetGraphSubset and AssetBackfillData with whitelist_for_serdes #17844 👈
        
        PR [7/n subset refactor] Use new asset backfill data serialization format #17929

This stack of pull requests is managed by Graphite.

clairelin135 · 2023-11-13T22:50:29Z

python_modules/dagster/dagster/_core/definitions/asset_graph_subset.py

+    NamedTuple(
+        "_AssetGraphSubset",
+        [
+            ("partitions_subsets_by_serialized_asset_key", Mapping[str, PartitionsSubset]),


Unfortunately json cannot serialize dictionaries that aren't keyed by a primitive or a string, so we key by the serialized asset key instead

Could it make sense to handle that in the custom serializer so it doesn't need to leak into the data model for the class?

Also, don't want to increase scope too much, but it could make sense to try to address this at the serialization layer if it's not too big a lift.

Also, don't want to increase scope too much, but it could make sense to try to address this at the serialization layer if it's not too big a lift.

I did take a stab at implementing it this way.

The properly serialized asset key pack_value(asset_key....) value is a dictionary, which cannot be used as a key to a dictionary

We could add custom logic in the serialization layer to convert an asset key to a string (i.e. asset_key.to_user_string() but this seems like a pain to deal with if we did decide to add additional fields to the AssetKey class

Could it make sense to handle that in the custom serializer so it doesn't need to leak into the data model for the class?

My initial version actually used this approach, it looked like this:

dagster/python_modules/dagster/dagster/_core/definitions/asset_graph_subset.py

Lines 42 to 69 in 79652f9

class PartitionsSubsetByAssetKeySerializer(FieldSerializer):

"""Packs and unpacks a mapping from AssetKey to PartitionsSubset.

In JSON, a key must be a str, int, float, bool, or None. This serializer packs the AssetKey

into a str, and unpacks it back into an AssetKey.

It also converts PartitionKeysTimeWindowPartitionsSubset into serializable TimeWindowPartitionsSubsets.

"""

def pack(self, mapping: Mapping[AssetKey, Any], **_kwargs) -> Mapping[str, Any]:

return {

serialize_value(key): serialize_value(

value.to_time_window_partitions_subset()

if isinstance(value, PartitionKeysTimeWindowPartitionsSubset)

else value

)

for key, value in mapping.items()

}

def unpack(

self,

mapping: Mapping[str, Any],

**_kwargs,

) -> Mapping[AssetKey, Any]:

return {

deserialize_value(key, AssetKey): deserialize_value(value, TimeWindowPartitionsSubset)

for key, value in mapping.items()

}

I moved away from this implementation because it felt like a duplicate of the existing logic, with the exception of converting the asset key to a string, so maybe it would be better to reduce code surface area by just converting the asset keys to serialized form.

I feel mixed on this though -- I can see either making sense. It certainly is cleaner to not have to convert to/from serialized asset keys.

but this seems like a pain to deal with if we did decide to add additional fields to the AssetKey class

I think that we can be confident that we are not going to do this. I suspect a lot of other places would break as well.

python_modules/dagster/dagster/_core/definitions/asset_graph_subset.py

clairelin135 · 2023-11-13T22:53:12Z

python_modules/dagster/dagster/_core/definitions/asset_graph_subset.py

-        if partitions_def is None:
-            check.failed("Can only call get_partitions_subset on a partitioned asset")
+    def get_partitions_subset(
+        self, asset_key: AssetKey, asset_graph: Optional[AssetGraph] = None


passed in asset graph because we'd like to get an empty subset instead of None if possible

the __oper__ callsite does not have access to the asset graph, so this is an optional param

sryza · 2023-11-13T23:49:59Z

python_modules/dagster/dagster/_core/definitions/asset_graph_subset.py

+    def before_pack(self, value: "AssetGraphSubset") -> "AssetGraphSubset":
+        converted_partitions_subsets_by_serialized_asset_key = {}
+        for k, v in value.partitions_subsets_by_serialized_asset_key.items():
+            if isinstance(v, PartitionKeysTimeWindowPartitionsSubset):


Would it make sense to put the custom serializer on PartitionKeysTimeWindowPartitionsSubset itself?

It would be nice to not have to convert these subset objects everywhere they're used, but I think on principle objects that don't cross a serialization boundary should not be decorated with @whitelist_for_serdes.

Ideally we have some parallel entity that means "convert to Y, then serialize/deserialize Y". Implementing something like that adds additional scope I'd prefer to avoid at this time

sryza · 2023-11-13T23:50:51Z

python_modules/dagster/dagster/_core/definitions/asset_graph_subset.py

+    NamedTuple(
+        "_AssetGraphSubset",
+        [
+            ("partitions_subsets_by_serialized_asset_key", Mapping[str, PartitionsSubset]),


Could it make sense to handle that in the custom serializer so it doesn't need to leak into the data model for the class?

clairelin135 · 2023-11-16T21:27:06Z

@sryza This PR has been updated to directly serialize partitions_subsets_by_asset_key following the changes in part 5 that enable this in the serdes layer

clairelin135 · 2023-11-22T20:03:27Z

python_modules/dagster/dagster/_core/definitions/asset_graph_subset.py

-        self._asset_graph = asset_graph
-        self._partitions_subsets_by_asset_key = partitions_subsets_by_asset_key or {}
-        self._non_partitioned_asset_keys = non_partitioned_asset_keys or set()
+class PartitionsSubsetMappingNamedTupleSerializer(NamedTupleSerializer):


@sryza I've refactored this serializer to be generalizable to other named tuples that contain partitions subsets mappings. Unfortunately serializers are instantiated on demand rather than at definition time, so we can't easily provide field names to apply this custom logic.

Instead, I've added logic to detect when partitions subsets exist and to convert those accordingly

That makes sense

This could use short docstring.

sryza

A few small remaining comments. Otherwise LGTM!

sryza · 2023-11-22T20:30:41Z

python_modules/dagster/dagster/_core/definitions/asset_graph_subset.py

-    def non_partitioned_asset_keys(self) -> AbstractSet[AssetKey]:
-        return self._non_partitioned_asset_keys
+@whitelist_for_serdes(serializer=PartitionsSubsetMappingNamedTupleSerializer)
+class AssetGraphSubset(


Could this just be

@whitelist_for_serdes(...) class AssetGraphSubset(NamedTuple): partitions_subsets_by_asset_key: Mapping[AssetKey, PartitionsSubset] non_partitioned_asset_keys: AbstractSet[AssetKey]

with no __new__?

sryza · 2023-11-22T20:31:14Z

python_modules/dagster/dagster/_core/definitions/asset_graph_subset.py

-        self._asset_graph = asset_graph
-        self._partitions_subsets_by_asset_key = partitions_subsets_by_asset_key or {}
-        self._non_partitioned_asset_keys = non_partitioned_asset_keys or set()
+class PartitionsSubsetMappingNamedTupleSerializer(NamedTupleSerializer):


That makes sense

sryza · 2023-11-22T20:32:20Z

python_modules/dagster/dagster/_core/definitions/asset_graph_subset.py

-        self._asset_graph = asset_graph
-        self._partitions_subsets_by_asset_key = partitions_subsets_by_asset_key or {}
-        self._non_partitioned_asset_keys = non_partitioned_asset_keys or set()
+class PartitionsSubsetMappingNamedTupleSerializer(NamedTupleSerializer):


This could use short docstring.

clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch 4 times, most recently from 5481f79 to 164476f Compare November 9, 2023 01:15

clairelin135 force-pushed the claire/default-subset-serialization branch from 239ef13 to d038030 Compare November 9, 2023 18:07

clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch from 164476f to 05c6643 Compare November 9, 2023 18:07

clairelin135 force-pushed the claire/default-subset-serialization branch from d038030 to 65d470f Compare November 9, 2023 18:16

clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch 3 times, most recently from 87445bb to 5a12ac3 Compare November 10, 2023 00:23

clairelin135 force-pushed the claire/default-subset-serialization branch from 65d470f to 662acf8 Compare November 10, 2023 00:24

clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch from 5a12ac3 to 7476390 Compare November 10, 2023 00:51

clairelin135 force-pushed the claire/default-subset-serialization branch from 662acf8 to 8afcf46 Compare November 10, 2023 22:39

clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch from 7476390 to d3f559c Compare November 10, 2023 22:39

clairelin135 mentioned this pull request Nov 11, 2023

[7/n subset refactor] Use new asset backfill data serialization format #17929

Merged

clairelin135 force-pushed the claire/default-subset-serialization branch from 8afcf46 to 71ba29e Compare November 13, 2023 21:12

clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch 2 times, most recently from 2b1ccab to fcefec8 Compare November 13, 2023 21:48

clairelin135 commented Nov 13, 2023

View reviewed changes

clairelin135 requested review from sryza and OwenKephart November 13, 2023 22:59

clairelin135 marked this pull request as ready for review November 13, 2023 22:59

sryza reviewed Nov 13, 2023

View reviewed changes

clairelin135 force-pushed the claire/default-subset-serialization branch from 71ba29e to 15244b0 Compare November 14, 2023 17:54

clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch from fcefec8 to e701159 Compare November 14, 2023 17:54

clairelin135 changed the title ~~[5/n subset refactor] Serialize AssetGraphSubset and AssetBackfillData with whitelist_for_serdes~~ [6/n subset refactor] Serialize AssetGraphSubset and AssetBackfillData with whitelist_for_serdes Nov 16, 2023

clairelin135 force-pushed the 11-15-enable_serializing_dicts_keyed_by_asset_key branch 2 times, most recently from 8be2bba to 3e699f0 Compare November 16, 2023 20:49

clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch from baf19c5 to 73a020f Compare November 16, 2023 20:49

clairelin135 force-pushed the 11-15-enable_serializing_dicts_keyed_by_asset_key branch from 3e699f0 to 31698f7 Compare November 16, 2023 22:36

clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch from 73a020f to 68d152d Compare November 16, 2023 22:36

clairelin135 force-pushed the 11-15-enable_serializing_dicts_keyed_by_asset_key branch from 31698f7 to 39bcd25 Compare November 17, 2023 00:11

clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch from 68d152d to 329f806 Compare November 17, 2023 00:11

clairelin135 force-pushed the 11-15-enable_serializing_dicts_keyed_by_asset_key branch 2 times, most recently from 769d52c to 656cd32 Compare November 20, 2023 23:38

clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch from 329f806 to 6850753 Compare November 20, 2023 23:46

clairelin135 force-pushed the 11-15-enable_serializing_dicts_keyed_by_asset_key branch 2 times, most recently from 793517c to bd4854f Compare November 21, 2023 19:44

clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch 2 times, most recently from 7d3db96 to 323344d Compare November 21, 2023 21:10

clairelin135 requested a review from sryza November 21, 2023 21:11

Base automatically changed from 11-15-enable_serializing_dicts_keyed_by_asset_key to master November 21, 2023 22:26

clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch from 323344d to 66d2810 Compare November 21, 2023 22:50

clairelin135 added 5 commits November 22, 2023 11:58

Make AssetGraphSubset and AssetBackfillData serializable

5651cab

asset graph subset changes

907d52c

clean up

1ca9882

key by asset key instead

903f0e2

make serializer generalizable

92b58a5

clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch from 66d2810 to 92b58a5 Compare November 22, 2023 19:58

clairelin135 commented Nov 22, 2023

View reviewed changes

sryza approved these changes Nov 22, 2023

View reviewed changes

pr feedback

07d1435

clairelin135 merged commit e9af094 into master Nov 22, 2023

clairelin135 deleted the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch November 22, 2023 22:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[6/n subset refactor] Serialize AssetGraphSubset and AssetBackfillData with whitelist_for_serdes #17844

[6/n subset refactor] Serialize AssetGraphSubset and AssetBackfillData with whitelist_for_serdes #17844

clairelin135 commented Nov 8, 2023 •

edited

Loading

clairelin135 commented Nov 8, 2023 •

edited

Loading

clairelin135 Nov 13, 2023

sryza Nov 13, 2023

sryza Nov 13, 2023

clairelin135 Nov 14, 2023

clairelin135 Nov 14, 2023

sryza Nov 14, 2023

clairelin135 Nov 13, 2023

sryza Nov 13, 2023

clairelin135 Nov 16, 2023

sryza Nov 13, 2023

clairelin135 commented Nov 16, 2023

clairelin135 Nov 22, 2023

sryza Nov 22, 2023

sryza Nov 22, 2023

sryza left a comment

sryza Nov 22, 2023

sryza Nov 22, 2023

sryza Nov 22, 2023

	class PartitionsSubsetByAssetKeySerializer(FieldSerializer):
	"""Packs and unpacks a mapping from AssetKey to PartitionsSubset.

	In JSON, a key must be a str, int, float, bool, or None. This serializer packs the AssetKey
	into a str, and unpacks it back into an AssetKey.

	It also converts PartitionKeysTimeWindowPartitionsSubset into serializable TimeWindowPartitionsSubsets.
	"""

	def pack(self, mapping: Mapping[AssetKey, Any], **_kwargs) -> Mapping[str, Any]:
	return {
	serialize_value(key): serialize_value(
	value.to_time_window_partitions_subset()
	if isinstance(value, PartitionKeysTimeWindowPartitionsSubset)
	else value
	)
	for key, value in mapping.items()
	}

	def unpack(
	self,
	mapping: Mapping[str, Any],
	**_kwargs,
	) -> Mapping[AssetKey, Any]:
	return {
	deserialize_value(key, AssetKey): deserialize_value(value, TimeWindowPartitionsSubset)
	for key, value in mapping.items()
	}

[6/n subset refactor] Serialize AssetGraphSubset and AssetBackfillData with whitelist_for_serdes #17844

[6/n subset refactor] Serialize AssetGraphSubset and AssetBackfillData with whitelist_for_serdes #17844

Conversation

clairelin135 commented Nov 8, 2023 • edited Loading

clairelin135 commented Nov 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clairelin135 commented Nov 16, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sryza left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clairelin135 commented Nov 8, 2023 •

edited

Loading

clairelin135 commented Nov 8, 2023 •

edited

Loading