Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Incremental model logic fails when initial loaded table is empty #9997

Closed
2 tasks done
BenameurZyad opened this issue Apr 22, 2024 · 6 comments
Closed
2 tasks done
Labels
enhancement New feature or request wontfix Not a bug or out of scope for dbt-core

Comments

@BenameurZyad
Copy link

BenameurZyad commented Apr 22, 2024

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

Currently, the is_incremental() macro checks different conditions in a model before activating the where clause to increment the load.
There is a case where, for any given reason (the source table was empty for example), the first build of a model results in an empty table.

The macro will check correctly all the conditions :

1- The model must already exist in the database
2- The destination table already exists in the database
3- The full-refresh flag is not passed
4- The running model is configured with materialized='incremental'

and activate the where clause :
where event_time >= (select max(event_time) from {{ this }})

Unfortunately, the where clause will not work because the returned max(event_time) will be null, consequence of the destination table in the database being empty. This causes the incremental model to be blocked by the empty table, and never increment the table, even when the data becomes available in the source ref.

Expected Behavior

The is_incremental() macro should include an additional condition :
5- The destination table is not empty

This should correct the faulty behaviour in the where clause introduced by the empty destination table in the database.
Whenever the data becomes available (after a fix for example) the model will build correctly.

Steps To Reproduce

1- Create an incremental model
2- Make sure the first run results in an empty table in the destination
3- Fix the model and make the data available again
4- Run the incremental model

Relevant log output

No response

Environment

- OS:
- Python:
- dbt:

Which database adapter are you using with dbt?

snowflake

Additional Context

No response

@BenameurZyad BenameurZyad added bug Something isn't working triage labels Apr 22, 2024
@b-per
Copy link
Contributor

b-per commented Apr 22, 2024

If you expect this use case of an empty initial table, can't you do a:

where event_time >= (select coalesce(max(event_time),'1900-01-01') from {{ this }})

@BenameurZyad
Copy link
Author

BenameurZyad commented Apr 22, 2024

This was not an expected case on our side. We had to spend time to find out that the root cause was an empty table.
We thought something didn't work as expected with the incremental config, as in our pipeline, nothing showed that the table was empty.
The solution you proposed is exactly what we implemented to fix it.

I imagine this could happen to others. Maybe it is worth being known as a use case for incremental models.
I guess it is not exactly a bug. More like an unexpected behavior if you are not warned.

It could easily be taken care of by adding this check in the is_incremental() macro:
5- The destination table is not empty

It should act as a fix and also a highlight of this possible case.

@dbeatty10
Copy link
Contributor

Thanks for these insight @b-per and @BenameurZyad. I'm going to re-categorize this a feature request for further consideration:

  • add a check in the is_incremental() macro for an empty table.

@dbeatty10 dbeatty10 added enhancement New feature or request and removed bug Something isn't working labels Apr 22, 2024
@dbeatty10 dbeatty10 changed the title Incremental model logic fails when initial loaded table is empty [Feature] Incremental model logic fails when initial loaded table is empty Apr 22, 2024
@BenameurZyad
Copy link
Author

Thanks a bunch. You guys are awsome.

@dbeatty10
Copy link
Contributor

In the meantime, I opened this issue to improve our documentation:

dbeatty10 added a commit to dbt-labs/docs.getdbt.com that referenced this issue Apr 24, 2024
[Preview](https://docs-getdbt-com-git-dbeatty10-patch-4-dbt-labs.vercel.app//docs/build/incremental-models#filtering-rows-on-an-incremental-run)

## What are you changing in this pull request and why?

resolves #5321

To ensure that the updated code will work for a broad number of users
without issues, I tested the following example against these data
platforms:
- bigquery
- databricks
- duckdb
- postgres
- redshift
- snowflake

<img width="782" alt="image"
src="https://github.com/dbt-labs/docs.getdbt.com/assets/44704949/0739892e-6f5d-45b8-ac1d-5bbd844cf096">

☝️ Notice the table is empty, like the edge case scenario described in
dbt-labs/dbt-core#9997

<img width="772" alt="image"
src="https://github.com/dbt-labs/docs.getdbt.com/assets/44704949/87ad6438-082f-4d65-9a8b-f97d36497c8e">

☝️ Notice it successfully added new data when it arrived.

<details>
<summary>

### Reprex
</summary>

Create this file:

`models/my_incremental.sql`

```sql
{{ config(materialized="incremental") }}

with

non_empty_cte as (

    select 1 as id, cast('2024-01-01' as date) as event_time

),

empty_cte as (

    select 0 as id, cast('1999-12-31' as date) as event_time
    from non_empty_cte
    where 0=1

)

select *

{% if var("scenario", "empty") == "empty" %}

  from empty_cte

{% else %}

  from non_empty_cte

{% endif %}

{% if is_incremental() %}

  -- this filter will only be applied on an incremental run
  -- (uses >= to include records whose timestamp occurred since the last run of this model)
  where event_time >= (select coalesce(max(event_time), cast('1900-01-01' as date)) from {{ this }})

{% endif %}
```

Assuming a `profiles.yml` with all the relevant profile names, run these
commands:

```shell
dbt run  --profile duckdb -s my_incremental --vars '{scenario: empty}' --full-refresh
dbt show --profile duckdb --inline "select * from {{ ref('my_incremental') }}"
dbt run  --profile duckdb -s my_incremental --vars '{scenario: empty}'
dbt show --profile duckdb --inline "select * from {{ ref('my_incremental') }}"
dbt run  --profile duckdb -s my_incremental --vars '{scenario: non_empty}'
dbt show --profile duckdb --inline "select * from {{ ref('my_incremental') }}"
dbt run  --profile postgres -s my_incremental --vars '{scenario: empty}' --full-refresh
dbt show --profile postgres --inline "select * from {{ ref('my_incremental') }}"
dbt run  --profile postgres -s my_incremental --vars '{scenario: empty}'
dbt show --profile postgres --inline "select * from {{ ref('my_incremental') }}"
dbt run  --profile postgres -s my_incremental --vars '{scenario: non_empty}'
dbt show --profile postgres --inline "select * from {{ ref('my_incremental') }}"
dbt run  --profile redshift -s my_incremental --vars '{scenario: empty}' --full-refresh
dbt show --profile redshift --inline "select * from {{ ref('my_incremental') }}"
dbt run  --profile redshift -s my_incremental --vars '{scenario: empty}'
dbt show --profile redshift --inline "select * from {{ ref('my_incremental') }}"
dbt run  --profile redshift -s my_incremental --vars '{scenario: non_empty}'
dbt show --profile redshift --inline "select * from {{ ref('my_incremental') }}"
dbt run  --profile databricks -s my_incremental --vars '{scenario: empty}' --full-refresh
dbt show --profile databricks --inline "select * from {{ ref('my_incremental') }}"
dbt run  --profile databricks -s my_incremental --vars '{scenario: empty}'
dbt show --profile databricks --inline "select * from {{ ref('my_incremental') }}"
dbt run  --profile databricks -s my_incremental --vars '{scenario: non_empty}'
dbt show --profile databricks --inline "select * from {{ ref('my_incremental') }}"
dbt run  --profile snowflake -s my_incremental --vars '{scenario: empty}' --full-refresh
dbt show --profile snowflake --inline "select * from {{ ref('my_incremental') }}"
dbt run  --profile snowflake -s my_incremental --vars '{scenario: empty}'
dbt show --profile snowflake --inline "select * from {{ ref('my_incremental') }}"
dbt run  --profile snowflake -s my_incremental --vars '{scenario: non_empty}'
dbt show --profile snowflake --inline "select * from {{ ref('my_incremental') }}"
dbt run  --profile bigquery -s my_incremental --vars '{scenario: empty}' --full-refresh
dbt show --profile bigquery --inline "select * from {{ ref('my_incremental') }}"
dbt run  --profile bigquery -s my_incremental --vars '{scenario: empty}'
dbt show --profile bigquery --inline "select * from {{ ref('my_incremental') }}"
dbt run  --profile bigquery -s my_incremental --vars '{scenario: non_empty}'
dbt show --profile bigquery --inline "select * from {{ ref('my_incremental') }}"
```

</details>

## Checklist
- [x] Review the [Content style
guide](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/content-style-guide.md)
so my content adheres to these guidelines.
@dbeatty10
Copy link
Contributor

@BenameurZyad I've updated the documentation within dbt-labs/docs.getdbt.com#5326 so that it includes an example that will work when the incremental table is empty.

I'm going to close this issue in favor of that update to the documentation. If we get additional feedback that this is not sufficient, we can consider re-opening this.

@dbeatty10 dbeatty10 closed this as not planned Won't fix, can't repro, duplicate, stale Apr 24, 2024
@dbeatty10 dbeatty10 added wontfix Not a bug or out of scope for dbt-core and removed triage labels Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request wontfix Not a bug or out of scope for dbt-core
Projects
None yet
Development

No branches or pull requests

3 participants