Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(dbt): emit column dependencies using sqlglot #20407

Merged
merged 1 commit into from
Mar 20, 2024

Conversation

rexledesma
Copy link
Contributor

@rexledesma rexledesma commented Mar 11, 2024

Summary & Motivation

Makes use of the great sqlglot library to build column lineage metadata when executing a dbt project.

We do this in the following steps:

  1. Retrieve the current dbt node's SQL file and its parents' column schemas.
  2. Retrieve the column names from the current node.
  3. For each column, retrieve its dependencies on upstream columns from direct parents. Basically just invoke lineage from sqlglot)
  4. Render the lineage as a JSON blob on the asset materialization for the dbt node.

To retrieve the dbt node's parents, and those corresponding nodes' column schemas, we augment our dagster dbt package implementation from #19623 to emit column schemas for the dbt node's parents. We make use of the dbt model variable to retrieve dbt node's refs/sources as relation objects to pass to adapter.get_columns_in_relation.

How I Tested These Changes

pytest

  • assert expected column dependencies against jaffle shop
  • assert expected column dependencies against executing a subset of jaffle shop
  • assert expected column dependencies against executing a subset of jaffle shop with ambiguous column selection (e.g. select *)

Copy link
Contributor Author

rexledesma commented Mar 11, 2024

@rexledesma rexledesma force-pushed the rl/column-lineage-metadata branch 6 times, most recently from ac721fb to 69142c1 Compare March 12, 2024 15:01
@rexledesma rexledesma marked this pull request as ready for review March 14, 2024 16:33
@rexledesma rexledesma requested a review from sryza March 14, 2024 16:33
Copy link
Contributor

@sryza sryza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this won't get lineage in the cases where we don't know the columns on the root nodes - is that right? Assuming the idea is to do that in a followup?

@rexledesma rexledesma force-pushed the rl/column-lineage-metadata branch 2 times, most recently from c7308a2 to 6e08330 Compare March 17, 2024 18:29
@rexledesma rexledesma requested review from sryza and bengotow March 17, 2024 18:32
@rexledesma rexledesma force-pushed the rl/column-lineage-metadata branch 2 times, most recently from 5ba9174 to 55a2f82 Compare March 18, 2024 01:47
@rexledesma rexledesma changed the base branch from master to rl/remove-tests March 18, 2024 11:51
@rexledesma rexledesma force-pushed the rl/column-lineage-metadata branch 4 times, most recently from ab841d7 to 9e61034 Compare March 18, 2024 17:21
Copy link

github-actions bot commented Mar 18, 2024

Deploy preview for dagster-docs ready!

Preview available at https://dagster-docs-901u68c52-elementl.vercel.app
https://rl-column-lineage-metadata.dagster.dagster-docs.io

Direct link to changed pages:

@rexledesma rexledesma changed the title feat(dbt): emit column lineage using sqlglot feat(dbt): emit column dependencies using sqlglot Mar 18, 2024
@rexledesma rexledesma force-pushed the rl/column-lineage-metadata branch from 9c2119a to c33f5da Compare March 18, 2024 21:06
Copy link
Contributor

@sryza sryza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything here looks great except for the MetadataValue stuff. I think life will be easier if we work that out before merging this. Otherwise, we'll need to deal with situations (annoying even if just internally) where we load data that's serialized in the old format.

Last thing is that I'm a little concerned about the perf impact. I don't think this is a blocker for merging, but I think it's worth seeing what the impact is in purina.

@rexledesma
Copy link
Contributor Author

Everything here looks great except for the MetadataValue stuff. I think life will be easier if we work that out before merging this. Otherwise, we'll need to deal with situations (annoying even if just internally) where we load data that's serialized in the old format.

We always take the most recent materialization, so if serialization is a problem, we could just re-materialize to blow it away.

Last thing is that I'm a little concerned about the perf impact. I don't think this is a blocker for merging, but I think it's worth seeing what the impact is in purina.

We will dogfood this with our setup to see if there are latency concerns.

@rexledesma rexledesma requested a review from sryza March 18, 2024 22:05
Base automatically changed from rl/remove-tests to master March 18, 2024 22:06
@rexledesma rexledesma force-pushed the rl/column-lineage-metadata branch 2 times, most recently from 579cb49 to a953dac Compare March 19, 2024 19:43
@rexledesma rexledesma force-pushed the rl/column-lineage-metadata branch from a953dac to 11709ae Compare March 19, 2024 20:45
@rexledesma rexledesma removed the request for review from erinkcochran87 March 19, 2024 21:40
@rexledesma rexledesma force-pushed the rl/column-lineage-metadata branch from 11709ae to 4f0105d Compare March 20, 2024 15:54
@rexledesma rexledesma force-pushed the rl/column-lineage-metadata branch 2 times, most recently from 0456137 to ba083cc Compare March 20, 2024 22:13
@rexledesma rexledesma force-pushed the rl/column-lineage-metadata branch from ba083cc to 0628a9c Compare March 20, 2024 22:39
Copy link
Contributor

@sryza sryza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I think this should either get squashed with #20569 or they should be merged in close succession.

Copy link
Contributor Author

rexledesma commented Mar 20, 2024

Merge activity

@rexledesma rexledesma merged commit 5d0f8fd into master Mar 20, 2024
2 checks passed
@rexledesma rexledesma deleted the rl/column-lineage-metadata branch March 20, 2024 23:11
rexledesma added a commit that referenced this pull request Mar 20, 2024
## Summary & Motivation
Apply #20477 on #20407 to see how the API feels.

## How I Tested These Changes
pytest
benpankow pushed a commit that referenced this pull request Mar 21, 2024
## Summary & Motivation
Makes use of the great `sqlglot` library to build column lineage metadata when executing a dbt project.

We do this in the following steps:

1. Retrieve the current dbt node's SQL file and its parents' column schemas.
2. Retrieve the column names from the current node. 
3. For each column, retrieve its dependencies on upstream columns from direct parents. Basically just invoke [`lineage`](https://sqlglot.com/sqlglot/lineage.html#lineage) from `sqlglot`)
4. Render the lineage as a JSON blob on the asset materialization for the dbt node.

To retrieve the dbt node's parents, and those corresponding nodes' column schemas, we augment our `dagster` dbt package implementation from #19623 to emit column schemas for the dbt node's parents. We make use of the dbt [`model`](https://docs.getdbt.com/reference/dbt-jinja-functions/model) variable to retrieve dbt node's refs/sources as relation objects to pass to [`adapter.get_columns_in_relation`](https://docs.getdbt.com/reference/dbt-jinja-functions/adapter#get_columns_in_relation).

## How I Tested These Changes
pytest
- assert expected column dependencies against jaffle shop
- assert expected column dependencies against executing a subset of jaffle shop
- assert expected column dependencies against executing a subset of jaffle shop with ambiguous column selection (e.g. `select *`)
PedramNavid pushed a commit that referenced this pull request Mar 28, 2024
## Summary & Motivation
Makes use of the great `sqlglot` library to build column lineage metadata when executing a dbt project.

We do this in the following steps:

1. Retrieve the current dbt node's SQL file and its parents' column schemas.
2. Retrieve the column names from the current node. 
3. For each column, retrieve its dependencies on upstream columns from direct parents. Basically just invoke [`lineage`](https://sqlglot.com/sqlglot/lineage.html#lineage) from `sqlglot`)
4. Render the lineage as a JSON blob on the asset materialization for the dbt node.

To retrieve the dbt node's parents, and those corresponding nodes' column schemas, we augment our `dagster` dbt package implementation from #19623 to emit column schemas for the dbt node's parents. We make use of the dbt [`model`](https://docs.getdbt.com/reference/dbt-jinja-functions/model) variable to retrieve dbt node's refs/sources as relation objects to pass to [`adapter.get_columns_in_relation`](https://docs.getdbt.com/reference/dbt-jinja-functions/adapter#get_columns_in_relation).

## How I Tested These Changes
pytest
- assert expected column dependencies against jaffle shop
- assert expected column dependencies against executing a subset of jaffle shop
- assert expected column dependencies against executing a subset of jaffle shop with ambiguous column selection (e.g. `select *`)
PedramNavid pushed a commit that referenced this pull request Mar 28, 2024
## Summary & Motivation
Apply #20477 on #20407 to see how the API feels.

## How I Tested These Changes
pytest
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants