Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[6/n subset refactor] Serialize AssetGraphSubset and AssetBackfillData with whitelist_for_serdes #17844

Conversation

clairelin135
Copy link
Contributor

@clairelin135 clairelin135 commented Nov 8, 2023

This PR makes AssetGraphSubset and AssetBackfillData serializable via whitelist_for_serdes.

This involves the following changes:

  • converts AssetGraphSubset and AssetBackfillData into NamedTuples
    • adds a custom serializer to convert PartitionKeysTimeWindowPartitionsSubset to TimeWindowsPartitionsSubset
  • removes the asset graph property from AssetGraphSubset as it is not serializable. This has a cascading effect:
    • callsites where we previously called asset_graph_subset.asset_graph must instead have an asset graph passed in
    • previously we could build empty partitions subsets when needed within AssetGraphSubset (i.e. within __or__). This logic now must be updated to handle cases where a partitions subset is currently None
    • now AssetGraphSubset or, and, and sub (|, &, -) operations cannot operate directly against sets of AssetKeyPartitionKeys, since the asset graph is required to build subsets from these AssetKeyPartitionKeys

@clairelin135 clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch 4 times, most recently from 5481f79 to 164476f Compare November 9, 2023 01:15
@clairelin135 clairelin135 force-pushed the claire/default-subset-serialization branch from 239ef13 to d038030 Compare November 9, 2023 18:07
@clairelin135 clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch from 164476f to 05c6643 Compare November 9, 2023 18:07
@clairelin135 clairelin135 force-pushed the claire/default-subset-serialization branch from d038030 to 65d470f Compare November 9, 2023 18:16
@clairelin135 clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch 3 times, most recently from 87445bb to 5a12ac3 Compare November 10, 2023 00:23
@clairelin135 clairelin135 force-pushed the claire/default-subset-serialization branch from 65d470f to 662acf8 Compare November 10, 2023 00:24
@clairelin135 clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch from 5a12ac3 to 7476390 Compare November 10, 2023 00:51
@clairelin135 clairelin135 force-pushed the claire/default-subset-serialization branch from 662acf8 to 8afcf46 Compare November 10, 2023 22:39
@clairelin135 clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch from 7476390 to d3f559c Compare November 10, 2023 22:39
@clairelin135 clairelin135 force-pushed the claire/default-subset-serialization branch from 8afcf46 to 71ba29e Compare November 13, 2023 21:12
@clairelin135 clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch 2 times, most recently from 2b1ccab to fcefec8 Compare November 13, 2023 21:48
NamedTuple(
"_AssetGraphSubset",
[
("partitions_subsets_by_serialized_asset_key", Mapping[str, PartitionsSubset]),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately json cannot serialize dictionaries that aren't keyed by a primitive or a string, so we key by the serialized asset key instead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it make sense to handle that in the custom serializer so it doesn't need to leak into the data model for the class?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, don't want to increase scope too much, but it could make sense to try to address this at the serialization layer if it's not too big a lift.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, don't want to increase scope too much, but it could make sense to try to address this at the serialization layer if it's not too big a lift.

I did take a stab at implementing it this way.

  • The properly serialized asset key pack_value(asset_key....) value is a dictionary, which cannot be used as a key to a dictionary
  • We could add custom logic in the serialization layer to convert an asset key to a string (i.e. asset_key.to_user_string() but this seems like a pain to deal with if we did decide to add additional fields to the AssetKey class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it make sense to handle that in the custom serializer so it doesn't need to leak into the data model for the class?

My initial version actually used this approach, it looked like this:

class PartitionsSubsetByAssetKeySerializer(FieldSerializer):
"""Packs and unpacks a mapping from AssetKey to PartitionsSubset.
In JSON, a key must be a str, int, float, bool, or None. This serializer packs the AssetKey
into a str, and unpacks it back into an AssetKey.
It also converts PartitionKeysTimeWindowPartitionsSubset into serializable TimeWindowPartitionsSubsets.
"""
def pack(self, mapping: Mapping[AssetKey, Any], **_kwargs) -> Mapping[str, Any]:
return {
serialize_value(key): serialize_value(
value.to_time_window_partitions_subset()
if isinstance(value, PartitionKeysTimeWindowPartitionsSubset)
else value
)
for key, value in mapping.items()
}
def unpack(
self,
mapping: Mapping[str, Any],
**_kwargs,
) -> Mapping[AssetKey, Any]:
return {
deserialize_value(key, AssetKey): deserialize_value(value, TimeWindowPartitionsSubset)
for key, value in mapping.items()
}

I moved away from this implementation because it felt like a duplicate of the existing logic, with the exception of converting the asset key to a string, so maybe it would be better to reduce code surface area by just converting the asset keys to serialized form.

I feel mixed on this though -- I can see either making sense. It certainly is cleaner to not have to convert to/from serialized asset keys.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but this seems like a pain to deal with if we did decide to add additional fields to the AssetKey class

I think that we can be confident that we are not going to do this. I suspect a lot of other places would break as well.

if partitions_def is None:
check.failed("Can only call get_partitions_subset on a partitioned asset")
def get_partitions_subset(
self, asset_key: AssetKey, asset_graph: Optional[AssetGraph] = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

passed in asset graph because we'd like to get an empty subset instead of None if possible

the __oper__ callsite does not have access to the asset graph, so this is an optional param

@clairelin135 clairelin135 marked this pull request as ready for review November 13, 2023 22:59
def before_pack(self, value: "AssetGraphSubset") -> "AssetGraphSubset":
converted_partitions_subsets_by_serialized_asset_key = {}
for k, v in value.partitions_subsets_by_serialized_asset_key.items():
if isinstance(v, PartitionKeysTimeWindowPartitionsSubset):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to put the custom serializer on PartitionKeysTimeWindowPartitionsSubset itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to not have to convert these subset objects everywhere they're used, but I think on principle objects that don't cross a serialization boundary should not be decorated with @whitelist_for_serdes.

Ideally we have some parallel entity that means "convert to Y, then serialize/deserialize Y". Implementing something like that adds additional scope I'd prefer to avoid at this time

NamedTuple(
"_AssetGraphSubset",
[
("partitions_subsets_by_serialized_asset_key", Mapping[str, PartitionsSubset]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it make sense to handle that in the custom serializer so it doesn't need to leak into the data model for the class?

@clairelin135 clairelin135 force-pushed the claire/default-subset-serialization branch from 71ba29e to 15244b0 Compare November 14, 2023 17:54
@clairelin135 clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch from fcefec8 to e701159 Compare November 14, 2023 17:54
@clairelin135 clairelin135 changed the title [5/n subset refactor] Serialize AssetGraphSubset and AssetBackfillData with whitelist_for_serdes [6/n subset refactor] Serialize AssetGraphSubset and AssetBackfillData with whitelist_for_serdes Nov 16, 2023
@clairelin135 clairelin135 force-pushed the 11-15-enable_serializing_dicts_keyed_by_asset_key branch 2 times, most recently from 8be2bba to 3e699f0 Compare November 16, 2023 20:49
@clairelin135 clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch from baf19c5 to 73a020f Compare November 16, 2023 20:49
@clairelin135
Copy link
Contributor Author

@sryza This PR has been updated to directly serialize partitions_subsets_by_asset_key following the changes in part 5 that enable this in the serdes layer

@clairelin135 clairelin135 force-pushed the 11-15-enable_serializing_dicts_keyed_by_asset_key branch from 3e699f0 to 31698f7 Compare November 16, 2023 22:36
@clairelin135 clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch from 73a020f to 68d152d Compare November 16, 2023 22:36
@clairelin135 clairelin135 force-pushed the 11-15-enable_serializing_dicts_keyed_by_asset_key branch from 31698f7 to 39bcd25 Compare November 17, 2023 00:11
@clairelin135 clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch from 68d152d to 329f806 Compare November 17, 2023 00:11
@clairelin135 clairelin135 force-pushed the 11-15-enable_serializing_dicts_keyed_by_asset_key branch 2 times, most recently from 769d52c to 656cd32 Compare November 20, 2023 23:38
@clairelin135 clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch from 329f806 to 6850753 Compare November 20, 2023 23:46
@clairelin135 clairelin135 force-pushed the 11-15-enable_serializing_dicts_keyed_by_asset_key branch 2 times, most recently from 793517c to bd4854f Compare November 21, 2023 19:44
@clairelin135 clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch 2 times, most recently from 7d3db96 to 323344d Compare November 21, 2023 21:10
@clairelin135 clairelin135 requested a review from sryza November 21, 2023 21:11
Base automatically changed from 11-15-enable_serializing_dicts_keyed_by_asset_key to master November 21, 2023 22:26
@clairelin135 clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch from 323344d to 66d2810 Compare November 21, 2023 22:50
@clairelin135 clairelin135 force-pushed the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch from 66d2810 to 92b58a5 Compare November 22, 2023 19:58
self._asset_graph = asset_graph
self._partitions_subsets_by_asset_key = partitions_subsets_by_asset_key or {}
self._non_partitioned_asset_keys = non_partitioned_asset_keys or set()
class PartitionsSubsetMappingNamedTupleSerializer(NamedTupleSerializer):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sryza I've refactored this serializer to be generalizable to other named tuples that contain partitions subsets mappings. Unfortunately serializers are instantiated on demand rather than at definition time, so we can't easily provide field names to apply this custom logic.

Instead, I've added logic to detect when partitions subsets exist and to convert those accordingly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could use short docstring.

Copy link
Contributor

@sryza sryza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small remaining comments. Otherwise LGTM!

def non_partitioned_asset_keys(self) -> AbstractSet[AssetKey]:
return self._non_partitioned_asset_keys
@whitelist_for_serdes(serializer=PartitionsSubsetMappingNamedTupleSerializer)
class AssetGraphSubset(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this just be

@whitelist_for_serdes(...)
class AssetGraphSubset(NamedTuple):
    partitions_subsets_by_asset_key: Mapping[AssetKey, PartitionsSubset]
   non_partitioned_asset_keys: AbstractSet[AssetKey]

with no __new__?

self._asset_graph = asset_graph
self._partitions_subsets_by_asset_key = partitions_subsets_by_asset_key or {}
self._non_partitioned_asset_keys = non_partitioned_asset_keys or set()
class PartitionsSubsetMappingNamedTupleSerializer(NamedTupleSerializer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense

self._asset_graph = asset_graph
self._partitions_subsets_by_asset_key = partitions_subsets_by_asset_key or {}
self._non_partitioned_asset_keys = non_partitioned_asset_keys or set()
class PartitionsSubsetMappingNamedTupleSerializer(NamedTupleSerializer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could use short docstring.

@clairelin135 clairelin135 merged commit e9af094 into master Nov 22, 2023
@clairelin135 clairelin135 deleted the 11-08-Make_AssetGraphSubset_and_AssetBackfillData_serializable branch November 22, 2023 22:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants