Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[amp-refactor][8/n] Whitelist AssetDaemonCursor for serdes #18954

Merged
merged 1 commit into from
Jan 19, 2024

Conversation

OwenKephart
Copy link
Contributor

@OwenKephart OwenKephart commented Dec 29, 2023

Summary & Motivation

Here, we finally (mostly) free ourselves from the tyranny of the bespoke serdes protocol that we were using for the AssetDaemonCursor.

At a high level, a few things need to happen in order for this to be possible:

  1. Before, we didn't actually bother passing the "extras" value through to the new cursor, because we didn't actually serialize the new cursor once it was created (and instead relied on the legacy method of keeping track of the values we cared about between ticks). Now, we need to actually wire that up.
  2. Before, the legacy serdes protocol handled cases where a serialized subset became invalid between ticks (because we were using the SerializedPartitionsSubset class). Now, we need to have some separate handling on the AssetSubset class to handle similar scenarios gracefully.

This PR also adds in a description field to each of the Condition objects. This was mostly to help with debugging when working on this PR (makes it easier to understand the snapshots when they're printed out), but it's something we know we'll want in the future, so I just went ahead and kept it.

We do still have to keep around a somewhat nasty backcompat path to handle the first tick after the conversion from legacy cursor to new cursor, but this should only be hit a single time per deployment.

In theory, it'd be possible to just start from an empty cursor if it was not possible to deserialize it into the new scheme, but the consequence would be that:

a) If anything was requested on the tick directly before the conversion, it would get re-requested (as we'd have no recollection of requesting that asset)
b) The tick would take much longer, as we'd need to rebuild everything from scratch.

This backcompat path can be removed at some point in the future, once we're confident that a large percentage of users will be operating with the new cursor structure (e.g. some number of months).

How I Tested These Changes

@OwenKephart
Copy link
Contributor Author

OwenKephart commented Dec 29, 2023

Current dependencies on/for this PR:

This stack of pull requests is managed by Graphite.

@@ -120,12 +124,47 @@ def from_asset_partitions_set(
),
)

def valid_subset(self, partitions_def: Optional[PartitionsDefinition]) -> "AssetSubset":
Copy link
Contributor

@sryza sryza Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth @clairelin135 taking a look at this

if raw_cursor is None:
return AssetDaemonCursor.empty(default_evaluation_id)
try:
decoded = base64.b64decode(raw_cursor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How important do we think this gzipping is? I'd rather have the serdes framework handle this than have custom logic for it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's probably not strictly necessary to do the gzipping (there aren't going to be that many total cursors floating around out there, as we only keep track of the most recent one), but it does make the cursor around 20x smaller. Probably worth discussing in standup tomorrow.

"""

condition_snapshot: "AssetConditionSnapshot"
extras: ExtrasDict
extras: Mapping[str, PackableValue]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you thought about how "extras" will play out as a user-facing name? What are the APIs that we'd expose where users would interact with it? Might it end up in the UI?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question -- for the built-in rules, I wasn't imagining this ever making its way into the UI, as it's really just an implementation detail. but there's definitely a world in which users might need to work with this for custom rules, in which case it might end up in the UI for debugging purposes (although I don't think that's a hard requirement)

Purely on the python level, I'd expect users to interact with it through the context something like:

def my_eval_fn(context):
    subset_with_some_property = context.extras
    subset_with_some_property |= ...
    context.set_extras(subset_with_some_property)
    ...

This pattern is pretty similar to the existing cursor pattern we have for sensors, but this is more permissive in what's allowed to be stored in this field (i.e. instead of just a raw string, you can put any serializable object, which is a pretty nice QoL improvement)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the docstring for set_extras look like?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe something along the lines of:

# method on the user context object
def update_extra_value(self, value: PackableValue) -> None:
    """Updates the extra value for this asset condition, which will be provided on the context for the
        next condition evaluation.

        This can be used to keep track of progress and avoid duplicate work across sensor
        evaluations.
        """

I'm of two minds on if this should actually just be called "cursor" in user-facing land (due to its similarities with the existing sensor cursor concept, and the fact that they likely won't need to / be allowed to interact directly with the "real" cursor object)

We do have a fair amount of latitude in choosing whatever user-facing name we want for this (although it would be best if the names lined up of course), but in a world where we showed "cursor" to the user, I'd probably still want to call these values "extra_values" or something of the sort in our internal storage representation (as they really are extra values on top of some other important cursor information)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "cursor" definitely the right name for the non-extra data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "cursor" pretty accurately describes the purpose of the data that we're tracking here. We could consider calling it something like "previous evaluation state", but I don't think that really frees up the word "cursor" for the purely-extra data.

Maybe we could just call the "extra" data "extra_state"? The user can then grab "extra_state" off the context and update "extra_state" as desired.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/dagster-io/dagster/pull/19035/files this PR updates those names, so we can hash out the final naming there

Copy link
Contributor

@clairelin135 clairelin135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look at the valid_subset function and dropped some thoughts

@OwenKephart OwenKephart force-pushed the AssetConditionCursor branch from 7561ebf to 05629cd Compare January 9, 2024 17:07
@OwenKephart OwenKephart force-pushed the Whitelist_AssetDaemonCursor_for_serdes branch 3 times, most recently from 07423a0 to d2c5da6 Compare January 11, 2024 23:09
@OwenKephart OwenKephart requested a review from sryza January 11, 2024 23:10
@OwenKephart OwenKephart force-pushed the Whitelist_AssetDaemonCursor_for_serdes branch from d2c5da6 to 2fe4ea1 Compare January 11, 2024 23:58
@OwenKephart OwenKephart force-pushed the Whitelist_AssetDaemonCursor_for_serdes branch 3 times, most recently from 3216f40 to ad97b74 Compare January 12, 2024 18:01
@OwenKephart OwenKephart force-pushed the Whitelist_AssetDaemonCursor_for_serdes branch from ad97b74 to 9f7fe6e Compare January 17, 2024 23:10
@OwenKephart OwenKephart force-pushed the Whitelist_AssetDaemonCursor_for_serdes branch from 9f7fe6e to 8e405e1 Compare January 18, 2024 23:17
@OwenKephart
Copy link
Contributor Author

OwenKephart commented Jan 19, 2024

Merge activity

  • Jan 19, 4:20 PM: @OwenKephart started a stack merge that includes this pull request via Graphite.
  • Jan 19, 4:33 PM: Graphite rebased this pull request as part of a merge.
  • Jan 19, 4:34 PM: @OwenKephart merged this pull request with Graphite.

Base automatically changed from AssetConditionCursor to master January 19, 2024 21:32
@OwenKephart OwenKephart force-pushed the Whitelist_AssetDaemonCursor_for_serdes branch from 8e405e1 to ef07600 Compare January 19, 2024 21:32
@OwenKephart OwenKephart merged commit b9f618a into master Jan 19, 2024
1 check was pending
@OwenKephart OwenKephart deleted the Whitelist_AssetDaemonCursor_for_serdes branch January 19, 2024 21:34
PedramNavid pushed a commit that referenced this pull request Jan 26, 2024
## Summary & Motivation

Here, we finally (mostly) free ourselves from the tyranny of the bespoke serdes protocol that we were using for the AssetDaemonCursor.

At a high level, a few things need to happen in order for this to be possible:

1. Before, we didn't actually bother passing the "extras" value through to the new cursor, because we didn't actually serialize the new cursor once it was created (and instead relied on the legacy method of keeping track of the values we cared about between ticks). Now, we need to actually wire that up.
2. Before, the legacy serdes protocol handled cases where a serialized subset became invalid between ticks (because we were using the SerializedPartitionsSubset class). Now, we need to have some separate handling on the AssetSubset class to handle similar scenarios gracefully.

This PR also adds in a description field to each of the Condition objects. This was mostly to help with debugging when working on this PR (makes it easier to understand the snapshots when they're printed out), but it's something we know we'll want in the future, so I just went ahead and kept it.

We do still have to keep around a somewhat nasty backcompat path to handle the first tick after the conversion from legacy cursor to new cursor, but this should only be hit a single time per deployment.

In theory, it'd be possible to just start from an empty cursor if it was not possible to deserialize it into the new scheme, but the consequence would be that:

a) If anything was requested on the tick directly before the conversion, it would get re-requested (as we'd have no recollection of requesting that asset)
b) The tick would take much longer, as we'd need to rebuild everything from scratch.

This backcompat path can be removed at some point in the future, once we're confident that a large percentage of users will be operating with the new cursor structure (e.g. some number of months).

## How I Tested These Changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants