Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] feat(metadata): add table spec metadata #20477

Closed

Conversation

sryza
Copy link
Contributor

@sryza sryza commented Mar 14, 2024

Summary & Motivation

This explores an approach to addressing https://github.com/dagster-io/internal/discussions/8709. I'm not convinced it's the right approach, but I think it's a good base for discussion.

The idea: when producing a metadata dictionary from a NamespacedMetadataEntries object, we take values that correspond to pydantic models and automatically turn them into JSON-structured objects, which are then wrapped in JsonMetadataValues.

The main downside here is that, if someone encounters this metadata value in the wild, they won't know what schema/type it's expected to conform to.

A potential next step could be to add a something like a python_class field to JsonMetadataValue, to indicate what python class it corresponds to.

How I Tested These Changes

@sryza sryza changed the base branch from master to column-lineage-metadata-base March 14, 2024 15:47
Copy link

github-actions bot commented Mar 14, 2024

Deploy preview for dagit-core-storybook ready!

✅ Preview
https://dagit-core-storybook-516na64jx-elementl.vercel.app
https://column-lineage-metadata.core-storybook.dagster-docs.io

Built with commit 7f7c500.
This pull request is being automatically deployed with vercel-action

@sryza sryza requested review from rexledesma and schrockn March 14, 2024 15:54
Comment on lines 1203 to 1204
if issubclass(value_class, BaseModel):
kwargs[key] = value_class.parse_obj(value.data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would want to use the serdes subsystem here, to ensure compatibility

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on in what contexts extract is used

Copy link
Contributor

@rexledesma rexledesma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there another PR that this depends on? I don't see TableColumnLineages defined anywhere.

@@ -39,20 +41,18 @@
from .table import ( # re-exported
TableColumn as TableColumn,
TableColumnConstraints as TableColumnConstraints,
TableColumnLineages as TableColumnLineages,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan that this is in plural form. I agree that TableColumnLineage would be incorrect, since it can be misinterpreted as the lineage for a singular column.

Could we maybe do TableColumnLevelLineage? Looking up "column lineage", "column level lineage" seems like the way its canonically described.

Then, our metadata key would also be dagster/column_level_lineage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan that this is in plural form.

Agree very strongly here. Pluralized classes like this have been a perpetual source of confusion in our code base. Coming up with variable names for collections of them is extremely annoying. AssetsDefinition is still a plague in this in regard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little bit orthogonal, but another thought here is that "Lineage" usually refers to a full ancestry, not just parents. Since this class only contains information about the direct parents, "lineage" is probably not actually the right word.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backing up here in the hopes that it will help with naming, here's a description of the thing we need to name: "An object that, for each of the columns in a single table, tracks the parent columns in parent tables that the column is derived from"

Copy link
Contributor Author

@sryza sryza Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TableColumnDepMap? TableColumnLevelDepMap? TableColumnParentMap?

"column level lineage" seems like the way its canonically described.

Agree that "column level lineage" is a good way to describe the overall feature. But I think it might be more general than this particular object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care if the user passes a single object here, or a list of objects?

If we are fine with the latter, I would like to go with TableColumnLevelLineageDep, and standardize our metadata key on dagster/column_level_lineage_deps.

It parallels AssetSpec's deps: Iterable[AssetDep]. We circumvent the need to put a specific Python class identifier to name the collection of objects. We instead just pluralize the singular object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting - hadn't considered a list of objects. Here's a relevant comment on the other PR: #20407 (comment). Based on conversation with Nick, I think the most straightforward path is to introduce a new MetadataValue subclass, which I think requires us to use a single object instead of a list.

TableColumnLevelLineageDep

This is a bit of a mouthful. If we take out "Level", does that result in a loss of clarity? E.g. does it open up situations where it could be misinterprted what it means?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TableColumnDep? If we agree that in XXXDep, XXX is the lineage of what we are talking about, so saying XXXLevelLineage is repetitive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds right to me

@rexledesma rexledesma force-pushed the column-lineage-metadata branch from b245543 to 9e44acd Compare March 19, 2024 16:19
@sryza sryza requested a review from erinkcochran87 as a code owner March 19, 2024 16:19
@rexledesma rexledesma changed the base branch from column-lineage-metadata-base to master March 19, 2024 16:19
Copy link
Contributor

rexledesma commented Mar 19, 2024

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @sryza and the rest of your teammates on Graphite Graphite

@rexledesma rexledesma removed the request for review from erinkcochran87 March 19, 2024 16:21
Copy link

Deploy preview for dagster-university ready!

✅ Preview
https://dagster-university-dhgwsgzy6-elementl.vercel.app
https://column-lineage-metadata.dagster-university.dagster-docs.io

Built with commit 9e44acd.
This pull request is being automatically deployed with vercel-action

Copy link

Deploy preview for dagit-storybook ready!

✅ Preview
https://dagit-storybook-mw5qyhmag-elementl.vercel.app
https://column-lineage-metadata.components-storybook.dagster-docs.io

Built with commit 9e44acd.
This pull request is being automatically deployed with vercel-action

@rexledesma rexledesma changed the base branch from master to rl/make-column-dependencies-optional March 19, 2024 19:44
@rexledesma rexledesma force-pushed the column-lineage-metadata branch from 9e44acd to f5f6cdb Compare March 19, 2024 19:44
@rexledesma rexledesma force-pushed the column-lineage-metadata branch from f5f6cdb to 6564860 Compare March 19, 2024 19:47
@rexledesma rexledesma changed the title RFC: column lineage and pydantic metadata [RFC] feat(metadata): add column specs Mar 19, 2024


@whitelist_for_serdes
class AssetColumnDep(NamedTuple):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be named TableColumnDep. But calling it TableColumnDep felt weird since TableColumn is already taken and the existing TableColumn doesn't have an existing reference to an asset.

Open to other suggestions.

@rexledesma rexledesma force-pushed the rl/make-column-dependencies-optional branch from e8d6efb to de86c55 Compare March 19, 2024 20:45
@rexledesma rexledesma force-pushed the column-lineage-metadata branch from 6564860 to 995ac5b Compare March 19, 2024 20:45
Copy link
Contributor Author

@sryza sryza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding correctly, normalize_metadata will transform TableMetadataEntries.column_specs into a JsonMetadataValue. I think we want it to be some sort of typed metadata value class similar to TableSchemaMetadataValue.

rexledesma and others added 5 commits March 20, 2024 11:35
## Summary & Motivation

- Adds `AssetSelection.tag` (public, experimental), which accepts a key
and a value
- Adds `AssetSelection.tag_string` (internal for now), which accepts a
string interprets it as a key-value pair
- Adds support in the asset CLI for `--select tag:foo=fooval`

## How I Tested These Changes

added tests
@rexledesma rexledesma force-pushed the rl/make-column-dependencies-optional branch from de86c55 to 926c496 Compare March 20, 2024 15:54
@rexledesma rexledesma force-pushed the column-lineage-metadata branch from 995ac5b to 7f7c500 Compare March 20, 2024 15:54
@rexledesma rexledesma changed the title [RFC] feat(metadata): add column specs [RFC] feat(metadata): add table spec metadata Mar 20, 2024
Copy link
Contributor Author

@sryza sryza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks solid in this PR except I have a couple reservations about the names "table spec" and "column spec":

  • With AssetSpec and AssetCheckSpec, we use the word "spec" for definitions. I think this could lead to confusion.
  • "Spec" is general enough that I think it could be interpreted to include schema, but that's in a separate metadata value. There's some argument for combining these, but I think sometimes it will be convenient to be able to attach schema and lineage metadata separately.

A couple ideas for names for the top-level object that avoid these issues and also avoid plurals:

  • TableColumnDepMap
  • TableColumnParentage
  • TableColumnLevelParentage

Sidenote, because I created this PR, I can't approve it, so might make sense to close this one and create a new one under your name.

@rexledesma
Copy link
Contributor

@sryza closing this PR in favor of #20597. Will address naming comments there.

@rexledesma rexledesma closed this Mar 20, 2024
@rexledesma rexledesma deleted the column-lineage-metadata branch March 20, 2024 17:55
rexledesma added a commit that referenced this pull request Mar 20, 2024
## Summary & Motivation
Reopening #20477 under my name for review/approval purposes.

- Under `NamespacedMetadataEntries` (which needs to be unpluralized), add a metadata object to store information about observed table columns and those columns' dependencies.
- Allow this metadata object to be structured when served over GraphQL

## How I Tested These Changes
pytest, stacked on #20569
rexledesma added a commit that referenced this pull request Mar 20, 2024
## Summary & Motivation
Apply #20477 on #20407 to see how the API feels.

## How I Tested These Changes
pytest
PedramNavid pushed a commit that referenced this pull request Mar 28, 2024
## Summary & Motivation
Reopening #20477 under my name for review/approval purposes.

- Under `NamespacedMetadataEntries` (which needs to be unpluralized), add a metadata object to store information about observed table columns and those columns' dependencies.
- Allow this metadata object to be structured when served over GraphQL

## How I Tested These Changes
pytest, stacked on #20569
PedramNavid pushed a commit that referenced this pull request Mar 28, 2024
## Summary & Motivation
Apply #20477 on #20407 to see how the API feels.

## How I Tested These Changes
pytest
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants