-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[5/n subset refactor] [serdes] Enable serializing mappings with non-scalar keys #18057
[5/n subset refactor] [serdes] Enable serializing mappings with non-scalar keys #18057
Conversation
Current dependencies on/for this PR:
This stack of pull requests is managed by Graphite. |
dbfc898
to
a5942f3
Compare
b5792a1
to
59f6dbf
Compare
@@ -0,0 +1,189 @@ | |||
import re |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
directly copies over AssetKey
code from the other file to avoid circular imports
59f6dbf
to
deb962a
Compare
for key, value in dict_val.items() | ||
} | ||
} | ||
|
||
return { | ||
key: _pack_value(value, whitelist_map, f"{descent_path}.{key}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apologies for snooping, but the title of this PR was very exciting as this has been a huge pain in the past -- but what happens if _pack_value()
is called on both the key and the value in all cases (so we don't need to special-case this asset key thing)?
_pack_value() is a no-op on ints, bools, floats, strs, etc., and so this shouldn't have any impact on existing dictionaries (I think), but transparently allows fancier objects to get serde'd properly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when you call _pack_value
on an asset key, it serializes it as a dictionary, which cannot exist as the key of a dictionary
deb962a
to
8be2bba
Compare
78a91dd
to
e6c5ca9
Compare
8be2bba
to
3e699f0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can help you come up with some benchmarks to verify, but suspect doing these type checks against all the keys all the time is too costly.
Maybe we can cook up a special type (AssetMap[TVal]
or something) that implements a dict like interface and then special case serdes against that?
|
||
dict_val = cast(dict, val) | ||
|
||
if dict_val and all(type(key) is AssetKey for key in dict_val.keys()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suspect this is a potentially rough perf regression (related comment on 675)
e6c5ca9
to
66d7ba9
Compare
3e699f0
to
31698f7
Compare
66d7ba9
to
3195b7e
Compare
31698f7
to
39bcd25
Compare
Not the end of the world, but I think this would make our codebase less grokkable. A lot of our core entities are these serializable NamedTuples, and someone ramping up and looking at them needs to wonder what the difference is between and |
I think you could limit the only reference to creating the instance in the body of |
from dagster._core.events import AssetKey | ||
|
||
return { | ||
AssetKey.from_user_string(key): _unpack_value(value, whitelist_map, context) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to_user_string
/ from_user_string
doesn't correctly roundtrip multipart asset keys that contain /
in them
>>> from dagster import AssetKey
>>> a = AssetKey(['this/sucks', 'a////bit'])
>>> a
AssetKey(['this/sucks', 'a////bit'])
>>> a.to_user_string()
'this/sucks/a////bit'
>>> b = AssetKey.from_user_string(a.to_user_string())
>>> b
AssetKey(['this', 'sucks', 'a', '', '', '', 'bit'])
>>> a == b
False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh this is a good catch. Initially I picked to_user_string
because I thought it would likely be more performant than using JSON, but probably a good idea to instead call asset_key.to_string()
39bcd25
to
769d52c
Compare
Deploy preview for dagit-core-storybook ready! ✅ Preview Built with commit 793517c. |
This PR has been updated to contain a custom I'm still looking, but haven't found an easy way to get the annotated key type of a mapping. |
769d52c
to
656cd32
Compare
I'd ideally like to end up in a world where we don't have |
I think this is ok, if we switched to using dataclasses we could always implement the |
We do this when inferring Dagster types, right? |
From a quick benchmark, performing the type check takes about 5% of the time that it takes to serialize the dict. import datetime
import json
import random
import string
from typing import Mapping
from dagster import AssetKey
def random_string() -> str:
return "".join(random.choice(string.ascii_uppercase + string.digits) for _ in range(10))
def make_asset_key_dict() -> Mapping[AssetKey, str]:
return {AssetKey(random_string()): random_string() for _ in range(10000)}
asset_key_dicts = [make_asset_key_dict() for _ in range(10)]
start = datetime.datetime.now()
for d in asset_key_dicts:
all(type(k) == AssetKey for k in d.keys())
end = datetime.datetime.now()
print(end - start)
start = datetime.datetime.now()
for d in asset_key_dicts:
json.dumps({str(k): v for k, v in d.items()})
end = datetime.datetime.now()
print(end - start) Output:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a quick benchmark, performing the type check takes about 5% of the time that it takes to serialize the dict.
So the performance concern with the type check is in the context of serializing all the other not AssetKey
-keyed dict
s flowing through and the extra cost on the check to see if they should get the special non scalar key handling.
With that focus, making the target not a dict
via this custom type AssetKeyMap
scheme works well as it incurs ~0 cost to the regular dict
s.
If we are not feeling this marker type approach, we could:
- Get cute / clever with the code to minimize the impact to the scalar key
dict
s . Checking only one key is one flavor of this that risks some weird errors if someone tries to flow a heterogenous keyed structure through. - Use the type hint information - it should be possible, the
get_origin
/get_args
APIs for poking at generics are a little goofy but we've made it work else where. Still would want benchmark and tweak how / when these checks fire to minimize their cost.
@@ -681,10 +708,18 @@ def _pack_value( | |||
_pack_value(item, whitelist_map, f"{descent_path}[{idx}]") | |||
for idx, item in enumerate(cast(list, val)) | |||
] | |||
if tval is AssetKeyMap: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you can double check in the benchmark but moving this if check down since the condition is more rare may be measurably beneficial
_V = TypeVar("_V") | ||
|
||
|
||
class AssetKeyMap(Mapping["AssetKey", _V]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its not much of a lift to instead of just supporting AssetKey
keys, support any serdes-able keys.
I was thinking Instead of the to_string-ed key -> value approach currently used at [1], we could pack them in the style of the items
API and do something like {'__map_items__': [[_pack(k), _pack(v)] for k, v in map.items()]}
this idea applies even if we change the detection scheme
if tval is AssetKeyMap: | ||
return { | ||
"__asset_key_map__": { | ||
key.to_string(): _pack_value(value, whitelist_map, f"{descent_path}.{key}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[1]
I'd expect it to be much faster in those cases, because we'd only check the first key and notice that it's not an |
I'm still concerned about the |
Right agree the keys check failing should be faster than the keys check passing, but even then this code path is hot enough that adding that key check is impactful. From this internal discussion around benchmarking
|
656cd32
to
793517c
Compare
793517c
to
bd4854f
Compare
Requested changes addressed -- enabled serializing mappings with generic non-scalar keys
"""Wrapper class for non-scalar key mappings, used to performantly type check when serializing | ||
without having to access types of specific keys. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without having to access types of specific keys.
nit: maybe something like
"without impacting the performance of serializing the more common scalar key dicts. May be replaceable with a different clever scheme"
@@ -686,6 +711,16 @@ def _pack_value( | |||
key: _pack_value(value, whitelist_map, f"{descent_path}.{key}") | |||
for key, value in cast(dict, val).items() | |||
} | |||
if tval is SerializableNonScalarKeyMapping: | |||
return { | |||
"__non_scalar_key_mapping_items__": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe something a little more terse like __mapping_items__
This PR enables serializing mappings with non-scalar keys, intended to be used for serializing mappings keyed by asset key. Previously this was not possible because
seven.json.dumps
can only serialize dictionaries keyed by a primitive or a string. Without this change, we would need a customFieldSerializer
when serializing any dict keyed by a non-scalar.