-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(dbt): emit column dependencies using sqlglot
#20407
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. Join @rexledesma and the rest of your teammates on Graphite |
ac721fb
to
69142c1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, this won't get lineage in the cases where we don't know the columns on the root nodes - is that right? Assuming the idea is to do that in a followup?
python_modules/libraries/dagster-dbt/dagster_dbt_tests/core/test_resources_v2.py
Outdated
Show resolved
Hide resolved
python_modules/libraries/dagster-dbt/dagster_dbt/core/resources_v2.py
Outdated
Show resolved
Hide resolved
python_modules/libraries/dagster-dbt/dagster_dbt/core/resources_v2.py
Outdated
Show resolved
Hide resolved
python_modules/libraries/dagster-dbt/dagster_dbt/core/resources_v2.py
Outdated
Show resolved
Hide resolved
python_modules/libraries/dagster-dbt/dagster_dbt/core/resources_v2.py
Outdated
Show resolved
Hide resolved
c7308a2
to
6e08330
Compare
5ba9174
to
55a2f82
Compare
ab841d7
to
9e61034
Compare
python_modules/libraries/dagster-dbt/dbt_packages/dagster/macros/log_columns_in_relation.sql
Outdated
Show resolved
Hide resolved
python_modules/libraries/dagster-dbt/dagster_dbt/core/resources_v2.py
Outdated
Show resolved
Hide resolved
9e61034
to
9c2119a
Compare
sqlglot
sqlglot
9c2119a
to
c33f5da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything here looks great except for the MetadataValue stuff. I think life will be easier if we work that out before merging this. Otherwise, we'll need to deal with situations (annoying even if just internally) where we load data that's serialized in the old format.
Last thing is that I'm a little concerned about the perf impact. I don't think this is a blocker for merging, but I think it's worth seeing what the impact is in purina.
python_modules/libraries/dagster-dbt/dbt_packages/dagster/macros/log_column_level_metadata.sql
Show resolved
Hide resolved
We always take the most recent materialization, so if serialization is a problem, we could just re-materialize to blow it away.
We will dogfood this with our setup to see if there are latency concerns. |
579cb49
to
a953dac
Compare
a953dac
to
11709ae
Compare
11709ae
to
4f0105d
Compare
0456137
to
ba083cc
Compare
ba083cc
to
0628a9c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. I think this should either get squashed with #20569 or they should be merged in close succession.
Merge activity
|
## Summary & Motivation Makes use of the great `sqlglot` library to build column lineage metadata when executing a dbt project. We do this in the following steps: 1. Retrieve the current dbt node's SQL file and its parents' column schemas. 2. Retrieve the column names from the current node. 3. For each column, retrieve its dependencies on upstream columns from direct parents. Basically just invoke [`lineage`](https://sqlglot.com/sqlglot/lineage.html#lineage) from `sqlglot`) 4. Render the lineage as a JSON blob on the asset materialization for the dbt node. To retrieve the dbt node's parents, and those corresponding nodes' column schemas, we augment our `dagster` dbt package implementation from #19623 to emit column schemas for the dbt node's parents. We make use of the dbt [`model`](https://docs.getdbt.com/reference/dbt-jinja-functions/model) variable to retrieve dbt node's refs/sources as relation objects to pass to [`adapter.get_columns_in_relation`](https://docs.getdbt.com/reference/dbt-jinja-functions/adapter#get_columns_in_relation). ## How I Tested These Changes pytest - assert expected column dependencies against jaffle shop - assert expected column dependencies against executing a subset of jaffle shop - assert expected column dependencies against executing a subset of jaffle shop with ambiguous column selection (e.g. `select *`)
## Summary & Motivation Makes use of the great `sqlglot` library to build column lineage metadata when executing a dbt project. We do this in the following steps: 1. Retrieve the current dbt node's SQL file and its parents' column schemas. 2. Retrieve the column names from the current node. 3. For each column, retrieve its dependencies on upstream columns from direct parents. Basically just invoke [`lineage`](https://sqlglot.com/sqlglot/lineage.html#lineage) from `sqlglot`) 4. Render the lineage as a JSON blob on the asset materialization for the dbt node. To retrieve the dbt node's parents, and those corresponding nodes' column schemas, we augment our `dagster` dbt package implementation from #19623 to emit column schemas for the dbt node's parents. We make use of the dbt [`model`](https://docs.getdbt.com/reference/dbt-jinja-functions/model) variable to retrieve dbt node's refs/sources as relation objects to pass to [`adapter.get_columns_in_relation`](https://docs.getdbt.com/reference/dbt-jinja-functions/adapter#get_columns_in_relation). ## How I Tested These Changes pytest - assert expected column dependencies against jaffle shop - assert expected column dependencies against executing a subset of jaffle shop - assert expected column dependencies against executing a subset of jaffle shop with ambiguous column selection (e.g. `select *`)
Summary & Motivation
Makes use of the great
sqlglot
library to build column lineage metadata when executing a dbt project.We do this in the following steps:
lineage
fromsqlglot
)To retrieve the dbt node's parents, and those corresponding nodes' column schemas, we augment our
dagster
dbt package implementation from #19623 to emit column schemas for the dbt node's parents. We make use of the dbtmodel
variable to retrieve dbt node's refs/sources as relation objects to pass toadapter.get_columns_in_relation
.How I Tested These Changes
pytest
select *
)