-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] feat(metadata): add table spec metadata #20477
[RFC] feat(metadata): add table spec metadata #20477
Conversation
Deploy preview for dagit-core-storybook ready! ✅ Preview Built with commit 7f7c500. |
if issubclass(value_class, BaseModel): | ||
kwargs[key] = value_class.parse_obj(value.data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we would want to use the serdes
subsystem here, to ensure compatibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on in what contexts extract
is used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there another PR that this depends on? I don't see TableColumnLineages
defined anywhere.
@@ -39,20 +41,18 @@ | |||
from .table import ( # re-exported | |||
TableColumn as TableColumn, | |||
TableColumnConstraints as TableColumnConstraints, | |||
TableColumnLineages as TableColumnLineages, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a fan that this is in plural form. I agree that TableColumnLineage
would be incorrect, since it can be misinterpreted as the lineage for a singular column.
Could we maybe do TableColumnLevelLineage
? Looking up "column lineage", "column level lineage" seems like the way its canonically described.
Then, our metadata key would also be dagster/column_level_lineage
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a fan that this is in plural form.
Agree very strongly here. Pluralized classes like this have been a perpetual source of confusion in our code base. Coming up with variable names for collections of them is extremely annoying. AssetsDefinition
is still a plague in this in regard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little bit orthogonal, but another thought here is that "Lineage" usually refers to a full ancestry, not just parents. Since this class only contains information about the direct parents, "lineage" is probably not actually the right word.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backing up here in the hopes that it will help with naming, here's a description of the thing we need to name: "An object that, for each of the columns in a single table, tracks the parent columns in parent tables that the column is derived from"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TableColumnDepMap
? TableColumnLevelDepMap
? TableColumnParentMap
?
"column level lineage" seems like the way its canonically described.
Agree that "column level lineage" is a good way to describe the overall feature. But I think it might be more general than this particular object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we care if the user passes a single object here, or a list of objects?
If we are fine with the latter, I would like to go with TableColumnLevelLineageDep
, and standardize our metadata key on dagster/column_level_lineage_deps
.
It parallels AssetSpec
's deps: Iterable[AssetDep]
. We circumvent the need to put a specific Python class identifier to name the collection of objects. We instead just pluralize the singular object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting - hadn't considered a list of objects. Here's a relevant comment on the other PR: #20407 (comment). Based on conversation with Nick, I think the most straightforward path is to introduce a new MetadataValue
subclass, which I think requires us to use a single object instead of a list.
TableColumnLevelLineageDep
This is a bit of a mouthful. If we take out "Level", does that result in a loss of clarity? E.g. does it open up situations where it could be misinterprted what it means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TableColumnDep
? If we agree that in XXXDep
, XXX
is the lineage of what we are talking about, so saying XXXLevelLineage
is repetitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds right to me
b245543
to
9e44acd
Compare
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
Deploy preview for dagster-university ready! ✅ Preview Built with commit 9e44acd. |
Deploy preview for dagit-storybook ready! ✅ Preview Built with commit 9e44acd. |
9e44acd
to
f5f6cdb
Compare
f5f6cdb
to
6564860
Compare
|
||
|
||
@whitelist_for_serdes | ||
class AssetColumnDep(NamedTuple): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be named TableColumnDep
. But calling it TableColumnDep
felt weird since TableColumn
is already taken and the existing TableColumn
doesn't have an existing reference to an asset.
Open to other suggestions.
e8d6efb
to
de86c55
Compare
6564860
to
995ac5b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm understanding correctly, normalize_metadata
will transform TableMetadataEntries.column_specs
into a JsonMetadataValue
. I think we want it to be some sort of typed metadata value class similar to TableSchemaMetadataValue
.
## Summary & Motivation - Adds `AssetSelection.tag` (public, experimental), which accepts a key and a value - Adds `AssetSelection.tag_string` (internal for now), which accepts a string interprets it as a key-value pair - Adds support in the asset CLI for `--select tag:foo=fooval` ## How I Tested These Changes added tests
de86c55
to
926c496
Compare
995ac5b
to
7f7c500
Compare
Deploy preview for dagster-docs ready! Preview available at https://dagster-docs-3vhmkda6l-elementl.vercel.app Direct link to changed pages: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks solid in this PR except I have a couple reservations about the names "table spec" and "column spec":
- With
AssetSpec
andAssetCheckSpec
, we use the word "spec" for definitions. I think this could lead to confusion. - "Spec" is general enough that I think it could be interpreted to include schema, but that's in a separate metadata value. There's some argument for combining these, but I think sometimes it will be convenient to be able to attach schema and lineage metadata separately.
A couple ideas for names for the top-level object that avoid these issues and also avoid plurals:
TableColumnDepMap
TableColumnParentage
TableColumnLevelParentage
Sidenote, because I created this PR, I can't approve it, so might make sense to close this one and create a new one under your name.
926c496
to
a683924
Compare
## Summary & Motivation Reopening #20477 under my name for review/approval purposes. - Under `NamespacedMetadataEntries` (which needs to be unpluralized), add a metadata object to store information about observed table columns and those columns' dependencies. - Allow this metadata object to be structured when served over GraphQL ## How I Tested These Changes pytest, stacked on #20569
## Summary & Motivation Reopening #20477 under my name for review/approval purposes. - Under `NamespacedMetadataEntries` (which needs to be unpluralized), add a metadata object to store information about observed table columns and those columns' dependencies. - Allow this metadata object to be structured when served over GraphQL ## How I Tested These Changes pytest, stacked on #20569
Summary & Motivation
This explores an approach to addressing https://github.com/dagster-io/internal/discussions/8709. I'm not convinced it's the right approach, but I think it's a good base for discussion.
The idea: when producing a metadata dictionary from a
NamespacedMetadataEntries
object, we take values that correspond to pydantic models and automatically turn them into JSON-structured objects, which are then wrapped inJsonMetadataValue
s.The main downside here is that, if someone encounters this metadata value in the wild, they won't know what schema/type it's expected to conform to.
A potential next step could be to add a something like a
python_class
field toJsonMetadataValue
, to indicate what python class it corresponds to.How I Tested These Changes