Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[5/n subset refactor] [serdes] Enable serializing mappings with non-scalar keys #18057

Merged

Conversation

clairelin135
Copy link
Contributor

@clairelin135 clairelin135 commented Nov 16, 2023

This PR enables serializing mappings with non-scalar keys, intended to be used for serializing mappings keyed by asset key. Previously this was not possible because seven.json.dumps can only serialize dictionaries keyed by a primitive or a string. Without this change, we would need a custom FieldSerializer when serializing any dict keyed by a non-scalar.

@clairelin135 clairelin135 force-pushed the 11-15-enable_serializing_dicts_keyed_by_asset_key branch from dbfc898 to a5942f3 Compare November 16, 2023 18:43
@clairelin135 clairelin135 changed the base branch from claire/default-subset-serialization to master November 16, 2023 18:43
@clairelin135 clairelin135 force-pushed the 11-15-enable_serializing_dicts_keyed_by_asset_key branch 3 times, most recently from b5792a1 to 59f6dbf Compare November 16, 2023 18:57
@@ -0,0 +1,189 @@
import re
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

directly copies over AssetKey code from the other file to avoid circular imports

@clairelin135 clairelin135 changed the base branch from master to claire/default-subset-serialization November 16, 2023 19:25
@clairelin135 clairelin135 force-pushed the 11-15-enable_serializing_dicts_keyed_by_asset_key branch from 59f6dbf to deb962a Compare November 16, 2023 19:25
@clairelin135 clairelin135 changed the title [4.5/n subset refactor] Enable serializing dicts keyed by asset key [5/n subset refactor] Enable serializing dicts keyed by asset key Nov 16, 2023
for key, value in dict_val.items()
}
}

return {
key: _pack_value(value, whitelist_map, f"{descent_path}.{key}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apologies for snooping, but the title of this PR was very exciting as this has been a huge pain in the past -- but what happens if _pack_value() is called on both the key and the value in all cases (so we don't need to special-case this asset key thing)?

_pack_value() is a no-op on ints, bools, floats, strs, etc., and so this shouldn't have any impact on existing dictionaries (I think), but transparently allows fancier objects to get serde'd properly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you call _pack_value on an asset key, it serializes it as a dictionary, which cannot exist as the key of a dictionary

@clairelin135 clairelin135 changed the title [5/n subset refactor] Enable serializing dicts keyed by asset key [5/n subset refactor] [serdes] Enable serializing dicts keyed by asset key Nov 16, 2023
@clairelin135 clairelin135 marked this pull request as ready for review November 16, 2023 20:38
@clairelin135 clairelin135 force-pushed the 11-15-enable_serializing_dicts_keyed_by_asset_key branch from deb962a to 8be2bba Compare November 16, 2023 20:48
@clairelin135 clairelin135 force-pushed the claire/default-subset-serialization branch from 78a91dd to e6c5ca9 Compare November 16, 2023 20:49
@clairelin135 clairelin135 force-pushed the 11-15-enable_serializing_dicts_keyed_by_asset_key branch from 8be2bba to 3e699f0 Compare November 16, 2023 20:49
Copy link
Member

@alangenfeld alangenfeld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can help you come up with some benchmarks to verify, but suspect doing these type checks against all the keys all the time is too costly.

Maybe we can cook up a special type (AssetMap[TVal] or something) that implements a dict like interface and then special case serdes against that?


dict_val = cast(dict, val)

if dict_val and all(type(key) is AssetKey for key in dict_val.keys()):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suspect this is a potentially rough perf regression (related comment on 675)

@clairelin135 clairelin135 force-pushed the claire/default-subset-serialization branch from e6c5ca9 to 66d7ba9 Compare November 16, 2023 22:36
@clairelin135 clairelin135 force-pushed the 11-15-enable_serializing_dicts_keyed_by_asset_key branch from 3e699f0 to 31698f7 Compare November 16, 2023 22:36
@clairelin135 clairelin135 force-pushed the claire/default-subset-serialization branch from 66d7ba9 to 3195b7e Compare November 16, 2023 23:08
Base automatically changed from claire/default-subset-serialization to master November 16, 2023 23:37
@clairelin135 clairelin135 force-pushed the 11-15-enable_serializing_dicts_keyed_by_asset_key branch from 31698f7 to 39bcd25 Compare November 17, 2023 00:11
@sryza
Copy link
Contributor

sryza commented Nov 20, 2023

the idea is that the custom type implements Mapping[AssetKey, TVal] so the only places that would have to know about the new type would be the decl of the serializable NamedTuples, any callsites would just treat it as the interface Mapping[AssetKey, Tval]

Not the end of the world, but I think this would make our codebase less grokkable. A lot of our core entities are these serializable NamedTuples, and someone ramping up and looking at them needs to wonder what the difference is between and AssetMap and a Mapping[AssetKey, ...].

@alangenfeld
Copy link
Member

I think you could limit the only reference to creating the instance in the body of __new__ (with a comment). The typehints on NamedTuple and __new__ could stay Mapping

from dagster._core.events import AssetKey

return {
AssetKey.from_user_string(key): _unpack_value(value, whitelist_map, context)
Copy link
Member

@alangenfeld alangenfeld Nov 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_user_string / from_user_string doesn't correctly roundtrip multipart asset keys that contain / in them

>>> from dagster import AssetKey
>>> a = AssetKey(['this/sucks', 'a////bit'])
>>> a
AssetKey(['this/sucks', 'a////bit'])
>>> a.to_user_string()
'this/sucks/a////bit'
>>> b = AssetKey.from_user_string(a.to_user_string())
>>> b
AssetKey(['this', 'sucks', 'a', '', '', '', 'bit'])
>>> a == b
False

Copy link
Contributor Author

@clairelin135 clairelin135 Nov 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh this is a good catch. Initially I picked to_user_string because I thought it would likely be more performant than using JSON, but probably a good idea to instead call asset_key.to_string()

@clairelin135 clairelin135 force-pushed the 11-15-enable_serializing_dicts_keyed_by_asset_key branch from 39bcd25 to 769d52c Compare November 20, 2023 23:29
Copy link

github-actions bot commented Nov 20, 2023

Deploy preview for dagit-core-storybook ready!

✅ Preview
https://dagit-core-storybook-3c1y26zsp-elementl.vercel.app
https://11-15-enable-serializing-dicts-keyed-by-asset-key.core-storybook.dagster-docs.io

Built with commit 793517c.
This pull request is being automatically deployed with vercel-action

@clairelin135
Copy link
Contributor Author

This PR has been updated to contain a custom AssetKeyMap object. Since it subclasses Mapping, we indeed can just call this object within __new__ so we don't need to change the field types to use AssetKeyMap. Interested in thoughts so far.

I'm still looking, but haven't found an easy way to get the annotated key type of a mapping.

@clairelin135 clairelin135 force-pushed the 11-15-enable_serializing_dicts_keyed_by_asset_key branch from 769d52c to 656cd32 Compare November 20, 2023 23:38
@sryza
Copy link
Contributor

sryza commented Nov 20, 2023

This PR has been updated to contain a custom AssetKeyMap object. Since it subclasses Mapping, we indeed can just call this object within new so we don't need to change the field types to use AssetKeyMap. Interested in thoughts so far.

I'd ideally like to end up in a world where we don't have __new__, using dataclasses.

@clairelin135
Copy link
Contributor Author

I'd ideally like to end up in a world where we don't have new, using dataclasses.

I think this is ok, if we switched to using dataclasses we could always implement the __post_init to enforce that the mapping is an AssetKeyMap

@sryza
Copy link
Contributor

sryza commented Nov 21, 2023

I'm still looking, but haven't found an easy way to get the annotated key type of a mapping.

We do this when inferring Dagster types, right?

@sryza
Copy link
Contributor

sryza commented Nov 21, 2023

From a quick benchmark, performing the type check takes about 5% of the time that it takes to serialize the dict.

import datetime
import json
import random
import string
from typing import Mapping

from dagster import AssetKey


def random_string() -> str:
    return "".join(random.choice(string.ascii_uppercase + string.digits) for _ in range(10))


def make_asset_key_dict() -> Mapping[AssetKey, str]:
    return {AssetKey(random_string()): random_string() for _ in range(10000)}


asset_key_dicts = [make_asset_key_dict() for _ in range(10)]

start = datetime.datetime.now()
for d in asset_key_dicts:
    all(type(k) == AssetKey for k in d.keys())

end = datetime.datetime.now()

print(end - start)


start = datetime.datetime.now()
for d in asset_key_dicts:
    json.dumps({str(k): v for k, v in d.items()})

end = datetime.datetime.now()
print(end - start)

Output:

0:00:00.006121
0:00:00.105415

Copy link
Member

@alangenfeld alangenfeld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a quick benchmark, performing the type check takes about 5% of the time that it takes to serialize the dict.

So the performance concern with the type check is in the context of serializing all the other not AssetKey-keyed dict s flowing through and the extra cost on the check to see if they should get the special non scalar key handling.

With that focus, making the target not a dict via this custom type AssetKeyMap scheme works well as it incurs ~0 cost to the regular dict s.

If we are not feeling this marker type approach, we could:

  • Get cute / clever with the code to minimize the impact to the scalar key dict s . Checking only one key is one flavor of this that risks some weird errors if someone tries to flow a heterogenous keyed structure through.
  • Use the type hint information - it should be possible, the get_origin / get_args APIs for poking at generics are a little goofy but we've made it work else where. Still would want benchmark and tweak how / when these checks fire to minimize their cost.

@@ -681,10 +708,18 @@ def _pack_value(
_pack_value(item, whitelist_map, f"{descent_path}[{idx}]")
for idx, item in enumerate(cast(list, val))
]
if tval is AssetKeyMap:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can double check in the benchmark but moving this if check down since the condition is more rare may be measurably beneficial

_V = TypeVar("_V")


class AssetKeyMap(Mapping["AssetKey", _V]):
Copy link
Member

@alangenfeld alangenfeld Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not much of a lift to instead of just supporting AssetKey keys, support any serdes-able keys.

I was thinking Instead of the to_string-ed key -> value approach currently used at [1], we could pack them in the style of the items API and do something like {'__map_items__': [[_pack(k), _pack(v)] for k, v in map.items()]}

this idea applies even if we change the detection scheme

if tval is AssetKeyMap:
return {
"__asset_key_map__": {
key.to_string(): _pack_value(value, whitelist_map, f"{descent_path}.{key}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[1]

@sryza
Copy link
Contributor

sryza commented Nov 21, 2023

So the performance concern with the type check is in the context of serializing all the other not AssetKey-keyed dict s flowing through and the extra cost on the check to see if they should get the special non scalar key handling.

I'd expect it to be much faster in those cases, because we'd only check the first key and notice that it's not an AssetKey.

@sryza
Copy link
Contributor

sryza commented Nov 21, 2023

I'm still concerned about the AssetKeyMap approach, because it adds to the lore you need to know about the codebase, to do basic data modeling.

@alangenfeld
Copy link
Member

I'd expect it to be much faster in those cases

Right agree the keys check failing should be faster than the keys check passing, but even then this code path is hot enough that adding that key check is impactful.

From this internal discussion around benchmarking

https://dagsterlabs.slack.com/archives/C03CA4TVCAW/p1700520085656059?thread_ts=1700517125.390159&cid=C03CA4TVCAW

seeing the serialization times increase by about 1.6x

@clairelin135 clairelin135 force-pushed the 11-15-enable_serializing_dicts_keyed_by_asset_key branch from 656cd32 to 793517c Compare November 21, 2023 19:41
@clairelin135 clairelin135 force-pushed the 11-15-enable_serializing_dicts_keyed_by_asset_key branch from 793517c to bd4854f Compare November 21, 2023 19:44
@clairelin135 clairelin135 changed the title [5/n subset refactor] [serdes] Enable serializing dicts keyed by asset key [5/n subset refactor] [serdes] Enable serializing mappings with non-scalar keys Nov 21, 2023
@clairelin135 clairelin135 dismissed alangenfeld’s stale review November 21, 2023 20:59

Requested changes addressed -- enabled serializing mappings with generic non-scalar keys

Comment on lines 99 to 100
"""Wrapper class for non-scalar key mappings, used to performantly type check when serializing
without having to access types of specific keys.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without having to access types of specific keys.

nit: maybe something like

"without impacting the performance of serializing the more common scalar key dicts. May be replaceable with a different clever scheme"

@@ -686,6 +711,16 @@ def _pack_value(
key: _pack_value(value, whitelist_map, f"{descent_path}.{key}")
for key, value in cast(dict, val).items()
}
if tval is SerializableNonScalarKeyMapping:
return {
"__non_scalar_key_mapping_items__": [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe something a little more terse like __mapping_items__

@clairelin135 clairelin135 merged commit dde3a5e into master Nov 21, 2023
@clairelin135 clairelin135 deleted the 11-15-enable_serializing_dicts_keyed_by_asset_key branch November 21, 2023 22:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants