Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Lineage Graph Ignore Sources Declared From graph.sources.values() #10665

Closed
2 tasks done
ghilman27 opened this issue Sep 5, 2024 · 5 comments
Closed
2 tasks done
Labels
bug Something isn't working wontfix Not a bug or out of scope for dbt-core

Comments

@ghilman27
Copy link

ghilman27 commented Sep 5, 2024

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

GIVEN THAT my sources.yml

version: 2
sources:
  - name: load
    database: "{{ env_var('GCP_PROJECT_ID') }}"
    schema: "{{ env_var('BQ_DATASET_LOAD') }}"
    tables:
      - &masked-source-schema
        name: api_masked_1746
        tags: ["source.masked"]
        config:
          meta:
            id: 1746
            ads_is_used: false
            ...other_meta
      - <<: *masked-source-schema
        name: api_masked_2892
        config:
          meta:
            id: 2892
            ads_is_used: false
            ...other_meta
      - <<: *masked-source-schema
        name: api_masked_10002
        config:
          meta:
            id: 10002
            ads_is_used: false
            ...other_meta
      - <<: *masked-source-schema
        name: api_masked_10005
        config:
          meta:
            id: 10005
            ads_is_used: false
            ...other_meta
      - <<: *masked-source-schema
        name: .....other_source_names
        ....

And my model definition looks like this

{{
  config(
    materialized = 'incremental',
    incremental_strategy = 'insert_overwrite',
    partition_by = {
      "field": "partition_date",
      "data_type": "date",
      "granularity": "day",
      "time_ingestion_partitioning": true,
      "copy_partitions": true
    }
  )
}}

{# WITH graph.sources #}
{%- set graph_tables = ["api_masked_10002", "api_masked_10005"] -%}
{%- for node in graph.sources.values() -%}
  {% if node.name in graph_tables %}
    SELECT * FROM {{ source('load', node.name) }}
    UNION ALL
  {% endif %}
{%- endfor -%}

{# Using ordinary variables #}
{% set ordinary_tables = ['api_masked_1746', 'api_masked_2892'] %}
{% for ordinary_table in ordinary_tables %}
  SELECT * FROM {{ source('load', ordinary_table) }}
  {% if not loop.last %}
  UNION ALL
  {% endif %}
{% endfor %}

I then got the compiled query like this

    SELECT * FROM `my_bq_project`.`my_bq_dataset.`api_masked_10002`
    UNION ALL
  
    SELECT * FROM `my_bq_project`.`my_bq_dataset`.`api_masked_10005`
    UNION ALL
  


  SELECT * FROM `my_bq_project`.`my_bq_dataset`.`api_masked_1746`
  
  UNION ALL
  

  SELECT * FROM `my_bq_project`.`my_bq_dataset`.`api_masked_2892`
  

But the lineage graph only shows api_masked_1746 and api_masked_2892.
api_masked_10002 and api_masked_10005 are ignored

Expected Behavior

See all sources including api_masked_10002 and api_masked_10005 on Lineage Graph

Steps To Reproduce

put the same config as mine using the same environment, then you are set

Relevant log output

image

no screenshot is available due to its confidentiality

Environment

  • I am using a dockerized dbt-bigquery meltano plugins. The meltano docker tag is meltano/meltano:v3.4.2-python3.10. The full version output of dbt --version
    image
  • OS
    image
  • Output of python --version inside the docker: 3.10.14

Which database adapter are you using with dbt?

bigquery

Additional Context

I haven't tried using the ordinary dbt (not the meltano dockerized version one). I'll update if I have time to check that.

@ghilman27 ghilman27 added bug Something isn't working triage labels Sep 5, 2024
@dbeatty10
Copy link
Contributor

Thanks for reporting this @ghilman27 !

So are you saying it looks like this:

image

But you expect it to look like this?

image

@ghilman27
Copy link
Author

ghilman27 commented Sep 6, 2024

Thanks for reporting this @ghilman27 !

So are you saying it looks like this:

image But you expect it to look like this? image

The green rectangle of load.api_masked_10002 and load.api_masked_10005 wasn't even there. I only saw two green rectangles (load.api_masked_1746 and load.api_masked_2892) with their arrow pointing to the blue rectangle (the model i tried to create).

And yes, I expect it to look like the one you pointed out

@dbeatty10
Copy link
Contributor

Here's the key insight:

  • you can't use the graph variable to build your DAG because it is your DAG!

So what you are reporting is actually the behavior we expect, and I'm going to close this as "not planned".

More detail

dbt has a couple main phases it goes through before running any SQL: parsing and compilation.

1. Parsing: During parsing, it builds a DAG of nodes (like sources, models, etc).

The result is the manifest. It gives the graph context variable. It is also what powers that visualization.

2. Compilation: Then dbt can use that manifest to fill in the fully qualified table names when compiling a model.

Example

You can see more tangibly why all those source nodes are available when you compile but not for the graph visualization if you compile the model code included below.

It prints out the value of the execute variable to make it easier to see which phase it is in.

You'll actually see that dbt runs your code not once, but TWICE -- one time when execute == False (parsing) and another when execute == True (compiling & running).

Toggle to see model code

models/model_10665.sql

{{ log("execute: " ~ execute, True) }}
{{ log("Num sources: " ~ graph.sources | length, True) }}

{# WITH graph.sources #}
{%- set graph_tables = [] -%}
{%- set graph_tables = ["api_masked_10002", "api_masked_10005"] -%}
{%- for node in graph.sources.values() -%}
  {% if node.name in graph_tables %}
    {%- do log("SELECT * FROM " ~ source('load', node.name), True) %}
    SELECT * FROM {{ source('load', node.name) }}
    {%- do log("UNION ALL", True) %}
    UNION ALL
  {% endif %}
{%- endfor -%}

{# Using ordinary variables #}
{% set ordinary_tables = ['api_masked_1746', 'api_masked_2892'] %}
{% for ordinary_table in ordinary_tables %}
  {%- do log("SELECT * FROM " ~ source('load', ordinary_table), True) %}
  SELECT * FROM {{ source('load', ordinary_table) }}
  {% if not loop.last %}
  {%- do log("UNION ALL", True) %}
  UNION ALL
  {% endif %}
{% endfor %}
dbt compile -s models/model_10665.sql --no-partial-parse
Screenshot 2024-09-05 at 6 51 39 PM

Documentation

We have a couple call-outs already about this here and here, but we'd be open to feedback if you think there's other ways we can enhance the documentation to explain that the graph can't be used to build the DAG.

image image

@dbeatty10 dbeatty10 closed this as not planned Won't fix, can't repro, duplicate, stale Sep 6, 2024
@dbeatty10 dbeatty10 added wontfix Not a bug or out of scope for dbt-core and removed triage labels Sep 6, 2024
@ghilman27
Copy link
Author

ghilman27 commented Sep 6, 2024

@dbeatty10 I see! I had a bad feeling that it might have a connection with the Red Heads up :(

My bad 🙏! To be honest, I didn't notice the Do not use the graph variable to build your DAG. Maybe because I am more familiar with the term documentation or docs from dbt docs generate CLI command, rather than DAG. It is also a bit "counter-intuitive" because I see the compiled query works well but not the generated docs.

A nice improvement would be to show my example here, either within the graph or execute documentation page.
Or maybe add a section that could illustrate all the caveats of graph (by comparing when execute==False and execute==True).

mirnawong1 added a commit to dbt-labs/docs.getdbt.com that referenced this issue Sep 6, 2024
[Preview](https://docs-getdbt-com-git-dbeatty10-patch-1-dbt-labs.vercel.app/reference/dbt-jinja-functions/execute)

## What are you changing in this pull request and why?

Noticed a couple places where we could add links while responding to
dbt-labs/dbt-core#10665.

## Checklist
- [x] I have reviewed the [Content style
guide](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/content-style-guide.md)
so my content adheres to these guidelines.

---------

Co-authored-by: Mirna Wong <[email protected]>
@dbeatty10
Copy link
Contributor

It is also a bit "counter-intuitive" because I see the compiled query works well but not the generated docs

Totally understandable!

Thanks for your suggestions for improving the documentation 🤩. I've included them in dbt-labs/docs.getdbt.com#6027.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working wontfix Not a bug or out of scope for dbt-core
Projects
None yet
Development

No branches or pull requests

2 participants