Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[3/n subset refactor] Add whitelist_for_serdes to TimeWindowPartitionsSubset #17702

Merged
merged 8 commits into from
Nov 16, 2023

Conversation

clairelin135
Copy link
Contributor

@clairelin135 clairelin135 commented Nov 3, 2023

This PR converts TimeWindowPartitionsSubset to a named tuple decorated with @whitelist_for_serdes. This is mostly field renames (i.e. self._included_time_windows -> self.included_time_windows).

There is a logic change to add a before_pack hook in whitelist_for_serdes. This enables mutating the named tuple before it is serialized, which is used in this PR to force calculating the # partitions in the TimeWindowPartitionsSubset. This has been added because there is perf logic to delay calculating the # partitions in TimeWindowPartitionsSubset until necessary.

@clairelin135 clairelin135 force-pushed the claire/serialize-time-window-partitions-subset branch 8 times, most recently from dbaca7d to a138c9f Compare November 6, 2023 21:05
@clairelin135 clairelin135 marked this pull request as ready for review November 6, 2023 21:46
).get_last_partition_window(current_time=current_time)

if not first_tw or not last_tw:
check.failed("No partitions found")

if len(self.included_time_windows) == 0:
if len(self.get_included_time_windows()) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be cleaner in this method if there was a single included_time_windows = self.get_included_time_windows() up here, rather than invoking get_included_time_windows() a bunch below

@sryza
Copy link
Contributor

sryza commented Nov 7, 2023

Named tuple subclassing doesn't work great -- superclass @abstractproperty methods override named tuple fields. So the PartitionsSubset.partitions_def property overrides the TimeWindowPartitionsSubset.partitions_def named tuple field, which sucks. Furthermore named tuple fields cannot be named with an underscore.

_asdict could help with this:

from abc import ABC, abstractproperty
from typing import NamedTuple


class Superclass(ABC):
    @abstractproperty
    def foo(self) -> str:
        ...


class Subclass(Superclass, NamedTuple("_Subclass", [("foo", str)])):
    @property
    def foo(self):
        return self._asdict()["foo"]

bar = Subclass("fdsjkfld")
print(bar.foo)

@sryza
Copy link
Contributor

sryza commented Nov 7, 2023

I think reversing the order of the parent classes also helps:

class Subclass(NamedTuple("_Subclass", [("foo", str)]), Superclass):
    ...

@clairelin135 clairelin135 force-pushed the claire/tw-partitions-def-whitelist-for-serdes branch 2 times, most recently from d264af4 to f320752 Compare November 7, 2023 22:47
@clairelin135 clairelin135 force-pushed the claire/serialize-time-window-partitions-subset branch 3 times, most recently from 5335af4 to 8ccd9d2 Compare November 7, 2023 23:56
@clairelin135
Copy link
Contributor Author

_asdict could help with this

Ahhh this is great!! Such subtle arts....

@clairelin135 clairelin135 force-pushed the claire/serialize-time-window-partitions-subset branch 2 times, most recently from e5bd107 to f667fd8 Compare November 8, 2023 01:20
@clairelin135 clairelin135 force-pushed the claire/tw-partitions-def-whitelist-for-serdes branch from f320752 to 0b5fa15 Compare November 9, 2023 18:05
@clairelin135 clairelin135 force-pushed the claire/serialize-time-window-partitions-subset branch from f667fd8 to 41b7804 Compare November 9, 2023 18:06
@clairelin135 clairelin135 force-pushed the claire/tw-partitions-def-whitelist-for-serdes branch from 0b5fa15 to d629763 Compare November 9, 2023 18:16
@clairelin135 clairelin135 force-pushed the claire/tw-partitions-def-whitelist-for-serdes branch from 2cdad9c to e97b5b2 Compare November 13, 2023 20:54
@clairelin135 clairelin135 force-pushed the claire/serialize-time-window-partitions-subset branch 2 times, most recently from 7e1cb7a to f72f939 Compare November 13, 2023 21:04
Base automatically changed from claire/tw-partitions-def-whitelist-for-serdes to master November 14, 2023 17:52
@clairelin135 clairelin135 force-pushed the claire/serialize-time-window-partitions-subset branch from f72f939 to 927eb63 Compare November 14, 2023 17:53
# is needed to improve performance. When serializing, we want to serialize the number of
# partitions, so we force calculatation.
def before_pack(self, value: "TimeWindowPartitionsSubset") -> "TimeWindowPartitionsSubset":
if value._asdict()["num_partitions"] is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value.num_partitions doesn't work here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value.num_partitions will calculate the # partitions if the field in the tuple is None

What we really want is to check if the field value is None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see - I think either worth including a comment or exposing a boolean property that tells whether it's computed.

@property
def included_time_windows(self) -> Sequence[TimeWindow]:
return self._included_time_windows
return _num_partitions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this is a local variable set to self._asdict()["num_partitions"]

included_time_windows, "included_time_windows", of_type=TimeWindow
check.sequence_param(included_time_windows, "included_time_windows", of_type=TimeWindow)

time_windows_with_timezone = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why exactly is this necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this change, the time windows will remain in UTC upon deserialization (since that is how the DatetimeFieldSerializer deserializes).

We currently have logic in from_serialized to convert these time windows to the timezone of the partitions def, so I added these lines to ensure that the timezone of the time windows is correct.

I'm not sure that it's strictly necessary (since all the tests were passing), but I figured this would probably be a beneficial change regardless

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this change, the time windows will remain in UTC upon deserialization (since that is how the DatetimeFieldSerializer deserializes).

Could we make DatetimeFieldSerializer preserve the timezone? It seems risky that the deserialized datetime could have a different datetime than the pre-serialized datetime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call -- I added a change that serializes with timezone

@clairelin135 clairelin135 force-pushed the claire/serialize-time-window-partitions-subset branch 3 times, most recently from 9d11a25 to 1aadacf Compare November 15, 2023 23:20
Copy link

Deploy preview for dagit-core-storybook ready!

✅ Preview
https://dagit-core-storybook-aqxa4j5h2-elementl.vercel.app
https://claire-serialize-time-window-partitions-subset.core-storybook.dagster-docs.io

Built with commit 1aadacf.
This pull request is being automatically deployed with vercel-action

Copy link
Contributor

@OwenKephart OwenKephart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note comments, otherwise great!


@cached_property
def num_partitions(self) -> int:
if self._num_partitions is None:
_num_partitions = self._asdict()["num_partitions"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the underscore notation here is a bit weird, this could just be "num_partitions"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or num_partitions_ if you want to disambiguate from the function name

@clairelin135 clairelin135 force-pushed the claire/serialize-time-window-partitions-subset branch from f88b440 to cd6e384 Compare November 16, 2023 20:49
@@ -54,24 +63,59 @@
from .partition_key_range import PartitionKeyRange


# UTCTimestampWithTimezone is used to preserve timezone information when serializing.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To elaborate on this a bit more, datetime.isoformat() will just store the UTC offset of the datetime. When this value is unpacked, adding timedeltas (or similar) is inexact because the IANA timezone now longer exists on the object. (More details on stack overflow)

In order to prevent any lossy serialization, this implementation serializes both the datetime float and the IANA timezone so that deserialization yields the exact datetime before serialization.

# We can't store datetime.isoformat() because it only preserves UTC offsets, which vary depending on
# daylight savings time.
@whitelist_for_serdes
class UTCTimestampWithTimezone(NamedTuple):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming nitpicks:

  • I think this could be just called TimestampWithTimezone, because timestamps (unlike datetimes) are usually assumed to be in UTC.
  • datetime_float -> timestamp. And worth including a comment that this refers to seconds since the Unix epoch.

clairelin135 and others added 8 commits November 16, 2023 13:36
continue

time window partitions subset changes

asset backfill serialization

partition mapping update

continue refactor

fix more tests

more test fixes

fix partition mapping tests

adjust test

fix more tests

add tests
@clairelin135 clairelin135 force-pushed the claire/serialize-time-window-partitions-subset branch from cd6e384 to 0e3a4fe Compare November 16, 2023 21:42
@clairelin135 clairelin135 merged commit b88f409 into master Nov 16, 2023
1 check passed
@clairelin135 clairelin135 deleted the claire/serialize-time-window-partitions-subset branch November 16, 2023 22:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants