-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Incremental model logic fails when initial loaded table is empty #9997
Comments
If you expect this use case of an empty initial table, can't you do a:
|
This was not an expected case on our side. We had to spend time to find out that the root cause was an empty table. I imagine this could happen to others. Maybe it is worth being known as a use case for incremental models. It could easily be taken care of by adding this check in the is_incremental() macro: It should act as a fix and also a highlight of this possible case. |
Thanks for these insight @b-per and @BenameurZyad. I'm going to re-categorize this a feature request for further consideration:
|
Thanks a bunch. You guys are awsome. |
In the meantime, I opened this issue to improve our documentation: |
[Preview](https://docs-getdbt-com-git-dbeatty10-patch-4-dbt-labs.vercel.app//docs/build/incremental-models#filtering-rows-on-an-incremental-run) ## What are you changing in this pull request and why? resolves #5321 To ensure that the updated code will work for a broad number of users without issues, I tested the following example against these data platforms: - bigquery - databricks - duckdb - postgres - redshift - snowflake <img width="782" alt="image" src="https://github.com/dbt-labs/docs.getdbt.com/assets/44704949/0739892e-6f5d-45b8-ac1d-5bbd844cf096"> ☝️ Notice the table is empty, like the edge case scenario described in dbt-labs/dbt-core#9997 <img width="772" alt="image" src="https://github.com/dbt-labs/docs.getdbt.com/assets/44704949/87ad6438-082f-4d65-9a8b-f97d36497c8e"> ☝️ Notice it successfully added new data when it arrived. <details> <summary> ### Reprex </summary> Create this file: `models/my_incremental.sql` ```sql {{ config(materialized="incremental") }} with non_empty_cte as ( select 1 as id, cast('2024-01-01' as date) as event_time ), empty_cte as ( select 0 as id, cast('1999-12-31' as date) as event_time from non_empty_cte where 0=1 ) select * {% if var("scenario", "empty") == "empty" %} from empty_cte {% else %} from non_empty_cte {% endif %} {% if is_incremental() %} -- this filter will only be applied on an incremental run -- (uses >= to include records whose timestamp occurred since the last run of this model) where event_time >= (select coalesce(max(event_time), cast('1900-01-01' as date)) from {{ this }}) {% endif %} ``` Assuming a `profiles.yml` with all the relevant profile names, run these commands: ```shell dbt run --profile duckdb -s my_incremental --vars '{scenario: empty}' --full-refresh dbt show --profile duckdb --inline "select * from {{ ref('my_incremental') }}" dbt run --profile duckdb -s my_incremental --vars '{scenario: empty}' dbt show --profile duckdb --inline "select * from {{ ref('my_incremental') }}" dbt run --profile duckdb -s my_incremental --vars '{scenario: non_empty}' dbt show --profile duckdb --inline "select * from {{ ref('my_incremental') }}" dbt run --profile postgres -s my_incremental --vars '{scenario: empty}' --full-refresh dbt show --profile postgres --inline "select * from {{ ref('my_incremental') }}" dbt run --profile postgres -s my_incremental --vars '{scenario: empty}' dbt show --profile postgres --inline "select * from {{ ref('my_incremental') }}" dbt run --profile postgres -s my_incremental --vars '{scenario: non_empty}' dbt show --profile postgres --inline "select * from {{ ref('my_incremental') }}" dbt run --profile redshift -s my_incremental --vars '{scenario: empty}' --full-refresh dbt show --profile redshift --inline "select * from {{ ref('my_incremental') }}" dbt run --profile redshift -s my_incremental --vars '{scenario: empty}' dbt show --profile redshift --inline "select * from {{ ref('my_incremental') }}" dbt run --profile redshift -s my_incremental --vars '{scenario: non_empty}' dbt show --profile redshift --inline "select * from {{ ref('my_incremental') }}" dbt run --profile databricks -s my_incremental --vars '{scenario: empty}' --full-refresh dbt show --profile databricks --inline "select * from {{ ref('my_incremental') }}" dbt run --profile databricks -s my_incremental --vars '{scenario: empty}' dbt show --profile databricks --inline "select * from {{ ref('my_incremental') }}" dbt run --profile databricks -s my_incremental --vars '{scenario: non_empty}' dbt show --profile databricks --inline "select * from {{ ref('my_incremental') }}" dbt run --profile snowflake -s my_incremental --vars '{scenario: empty}' --full-refresh dbt show --profile snowflake --inline "select * from {{ ref('my_incremental') }}" dbt run --profile snowflake -s my_incremental --vars '{scenario: empty}' dbt show --profile snowflake --inline "select * from {{ ref('my_incremental') }}" dbt run --profile snowflake -s my_incremental --vars '{scenario: non_empty}' dbt show --profile snowflake --inline "select * from {{ ref('my_incremental') }}" dbt run --profile bigquery -s my_incremental --vars '{scenario: empty}' --full-refresh dbt show --profile bigquery --inline "select * from {{ ref('my_incremental') }}" dbt run --profile bigquery -s my_incremental --vars '{scenario: empty}' dbt show --profile bigquery --inline "select * from {{ ref('my_incremental') }}" dbt run --profile bigquery -s my_incremental --vars '{scenario: non_empty}' dbt show --profile bigquery --inline "select * from {{ ref('my_incremental') }}" ``` </details> ## Checklist - [x] Review the [Content style guide](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/content-style-guide.md) so my content adheres to these guidelines.
@BenameurZyad I've updated the documentation within dbt-labs/docs.getdbt.com#5326 so that it includes an example that will work when the incremental table is empty. I'm going to close this issue in favor of that update to the documentation. If we get additional feedback that this is not sufficient, we can consider re-opening this. |
Is this a new bug in dbt-core?
Current Behavior
Currently, the is_incremental() macro checks different conditions in a model before activating the where clause to increment the load.
There is a case where, for any given reason (the source table was empty for example), the first build of a model results in an empty table.
The macro will check correctly all the conditions :
1- The model must already exist in the database
2- The destination table already exists in the database
3- The full-refresh flag is not passed
4- The running model is configured with materialized='incremental'
and activate the where clause :
where event_time >= (select max(event_time) from {{ this }})
Unfortunately, the where clause will not work because the returned max(event_time) will be null, consequence of the destination table in the database being empty. This causes the incremental model to be blocked by the empty table, and never increment the table, even when the data becomes available in the source ref.
Expected Behavior
The is_incremental() macro should include an additional condition :
5- The destination table is not empty
This should correct the faulty behaviour in the where clause introduced by the empty destination table in the database.
Whenever the data becomes available (after a fix for example) the model will build correctly.
Steps To Reproduce
1- Create an incremental model
2- Make sure the first run results in an empty table in the destination
3- Fix the model and make the data available again
4- Run the incremental model
Relevant log output
No response
Environment
Which database adapter are you using with dbt?
snowflake
Additional Context
No response
The text was updated successfully, but these errors were encountered: