Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1/n subset refactor] Split TimeWindowPartitionsSubset #17684

Merged

Conversation

clairelin135
Copy link
Contributor

@clairelin135 clairelin135 commented Nov 3, 2023

TimeWindowPartitionsSubset has lots of performance improvement code that relies on representing partitions as keys, but relies on converting partition keys to time windows in order to serialize. In preparation for serializing the time window version, this PR splits the implementation:

  • BaseTimePartitionsSubset: a time partitions subset superclass that defines abstract methods, and contains helper functions
  • TimePartitionKeyPartitionsSubset: the "partition key" representation
  • TimeWindowPartitionsSubset: the "time window" representation

By default, when creating partitions subsets via partitions_def.empty_subset().with_partition_keys(...), a TimePartitionKeyPartitionsSubset will be created. This class contains a method to convert to a TimeWindowPartitionsSubset when it needs to be serialized.

In future PRs, time partitions subsets can be directly deserialized as TimeWindowPartitionsSubsets.

@clairelin135 clairelin135 changed the base branch from master to claire/new-subset-serialization November 3, 2023 17:21
@clairelin135 clairelin135 force-pushed the 11-02-split_time_window_partitions_subset_implementation branch 2 times, most recently from 5bc3c4d to c9f78fc Compare November 3, 2023 18:47
@clairelin135 clairelin135 changed the base branch from claire/new-subset-serialization to master November 3, 2023 18:47
@clairelin135 clairelin135 force-pushed the 11-02-split_time_window_partitions_subset_implementation branch 4 times, most recently from a3c067a to cc35ffe Compare November 3, 2023 21:18
@clairelin135 clairelin135 force-pushed the 11-02-split_time_window_partitions_subset_implementation branch from cc35ffe to 27d3b83 Compare November 3, 2023 21:52
@clairelin135 clairelin135 changed the title [wip] Split TimeWindowPartitionsSubset into partition key vs time window form [1/n subset refactor] Split TimeWindowPartitionsSubset Nov 3, 2023
@clairelin135 clairelin135 marked this pull request as ready for review November 3, 2023 23:24
)

@property
@cached_method
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it takes no arguments, you can use @cached_property from functools here

Args:
dt_cron_schedule (str): A cron schedule that dt is on one of the ticks of.
"""
if self._included_time_windows is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would never be not None, right?

def __init__(
self,
partitions_def: TimeWindowPartitionsDefinition,
num_partitions: Optional[int] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd lean towards not providing default values for these arguments and instead requiring the caller to to think about them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

self._num_partitions = (
num_partitions
if num_partitions
else self._num_partitions_from_time_windows(partitions_def, included_time_windows)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be an expensive operation. If it wasn't previously part of the constructor, I'd be nervous that adding it could cause perf issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it -- I amended this to instead just be optional, and to only calculate on demand

included_time_windows, "included_time_windows", of_type=TimeWindow
)

def get_included_time_windows(self) -> Sequence[TimeWindow]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just call this included_time_windows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yeah we don't need this (probably accidentally added this)

included_partition_keys=self._included_partition_keys,
)

def resolve(self) -> "TimeWindowPartitionsSubset":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? I don't see it called anywhere.

Copy link
Contributor Author

@clairelin135 clairelin135 Nov 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not used in this PR, but I was planning on using it in later parts of this stack to enable serializing a TimePartitionKeyPartitionsSubset

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be clearer to add it in the later part that uses it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

@clairelin135 clairelin135 force-pushed the 11-02-split_time_window_partitions_subset_implementation branch 2 times, most recently from d024b99 to e9dc40f Compare November 7, 2023 22:45
Copy link
Contributor

@sryza sryza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks solid. Last thing: I'm wondering if there's a way to name these classes that makes their inheritance relationship clear. E.g. if I saw BaseTimeWindowPartitionsSubset and TimePartitionKeyPartitionsSubset in the wild, it wouldn't be obvious that the latter is a superclass of the former.

A convention-al way of naming these would be something like:

  • TimeWindowPartitionsSubset - abstract base class for subsets of a TimeWindowPartitionsDefinition
  • PartitionKeysTimeWindowPartitionsSubset - a TimeWindowPartitionsSubset that's represented by a set of partition keys.
  • TimeWindowsTimeWindowPartitionsSubset - a TimeWindowPartitionsSubset that's represented by a set of time windows.

TimeWindowsTimeWindowPartitionsSubset is a bit of a gnarly name of course, so definitely not a slam dunk.

included_partition_keys=self._included_partition_keys,
)


class TimeWindowPartitionsSubset(BaseTimeWindowPartitionsSubset):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could benefit from class-level docstring. Maybe "A PartitionsSubset for a TimeWindowPartitionsDefinition, which internally represents the included partitions using TimeWindow".

@clairelin135
Copy link
Contributor Author

clairelin135 commented Nov 8, 2023

What do you think of this:

  • BaseTimeWindowPartitionsSubset -> TimePartitionsSubset
  • TimePartitionKeyPartitionsSubset stays as is
  • TimeWindowPartitionsSubset stays as is

@sryza
Copy link
Contributor

sryza commented Nov 8, 2023

@clairelin135 how would you describe each of those?

@clairelin135
Copy link
Contributor Author

@sryza

  • TimePartitionsSubset: Base class for time partitions subsets that contains shared logic (i.e. building time windows from partition keys, checking equality/existence, etc.)
  • TimePartitionKeyPartitionsSubset: TimePartitionsSubset where included partitions are internally represented as strings. Primarily used in in-memory contexts to more performantly add/check partitions.
  • TimeWindowPartitionsSubset: TimePartitionsSubset where included partitions are internally represented as time windows. Primarily used for serialization to compress targeted windows.

@sryza
Copy link
Contributor

sryza commented Nov 8, 2023

We describe a TimeWindowPartitionsDefinition as "A set of partitions where each partitions corresponds to a time window." I'm concerned that TimePartitionsSubset might imply that there's a difference between a "time partition" and a "time window partition", especially if there's also a class called TimeWindowPartitionsSubset.

The reason I'm belaboring this is that some of these are going to end up in storage, so we won't be able to easily change them.

@clairelin135
Copy link
Contributor Author

@sryza that makes sense.

I'm hesitant about the naming of TimeWindowsTimeWindowPartitionsSubset because it feels redundant.

What if we kept BaseTimeWindowPartitionsSubset as is? In think "base" being in the name implies that it's a superclass.

Then we could rename TimePartitionKeyPartitionsSubset to PartitionKeysTimeWindowPartitionsSubset in order to clarify that it's a TimeWindowPartitionsSubset. I'm indifferent between PartitionKeysTimeWindowPartitionsSubset and TimeWindowPartitionKeysPartitionsSubset

@sryza
Copy link
Contributor

sryza commented Nov 9, 2023

@clairelin135 great, I like it. I vote for PartitionKeysTimeWindowPartitionsSubset, because it roughly (modulo "Base") follows the convention that a subclass of X that's specific in way Y should be named YX.

Copy link
Contributor

@sryza sryza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last comment is that these classes we discussed could benefit from class-level docstring. Otherwise, this looks good to go.

@clairelin135 clairelin135 force-pushed the 11-02-split_time_window_partitions_subset_implementation branch from 92c25eb to af983ea Compare November 9, 2023 18:16
@clairelin135 clairelin135 force-pushed the 11-02-split_time_window_partitions_subset_implementation branch from af983ea to ecc6f42 Compare November 10, 2023 00:23
@clairelin135 clairelin135 merged commit 525c3cc into master Nov 10, 2023
@clairelin135 clairelin135 deleted the 11-02-split_time_window_partitions_subset_implementation branch November 10, 2023 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants