-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-3493] [Bug] unique_key list incremental model has performance issues on the delete phase #150
Comments
Thanks for reporting this @nfarinha ! Based on your screenshots of the query plan, it looks like you are using dbt-snowflake. Let's dive into the details. Non-unique
|
Just tested now:
delete from DEV_CUSTOMER_SEG.dbt_z003kwbj_MARTS.dim_hier_versioned_territory
using DEV_CUSTOMER_SEG.dbt_z003kwbj_MARTS.dim_hier_versioned_territory__dbt_tmp
where (
DEV_CUSTOMER_SEG.dbt_z003kwbj_MARTS.dim_hier_versioned_territory__dbt_tmp.version_code = DEV_CUSTOMER_SEG.dbt_z003kwbj_MARTS.dim_hier_versioned_territory.version_code
and
DEV_CUSTOMER_SEG.dbt_z003kwbj_MARTS.dim_hier_versioned_territory__dbt_tmp.level_code = DEV_CUSTOMER_SEG.dbt_z003kwbj_MARTS.dim_hier_versioned_territory.level_code
);
***@***.***
Cancelled it.... Not worth to wait.
***@***.***
Deleted directly on Snowsight with the changed query.
***@***.***
Executed in no time as expected.
***@***.***
It's quite simple to replicate but let me know if you need some more detail on this. I'm very happy to redeem all the effort and time you guys have already spared me.
Best regards,
Nuno Farinha
From: Doug Beatty ***@***.***>
Sent: Wednesday, December 13, 2023 20:23
To: dbt-labs/dbt-core ***@***.***>
Cc: Farinha Goncalves, Nuno (DI IT EH PT 2) ***@***.***>; Mention ***@***.***>
Subject: Re: [dbt-labs/dbt-core] [CT-3493] [Bug] unique_key list incremental model has performance issues on the delete phase (Issue dbt-labs/dbt-adapters#150)
Thanks for reporting this @nfarinha<https://github.com/nfarinha> !
Based on your screenshots of the query plan, it looks like you are using dbt-snowflake. Let's dive into the details.
Non-unique unique_key
The delete statement is assuming that the key is really unique. That probably is not the case if you are using "delete+insert".
The folks that have contributed to dbt-labs/docs.getdbt.com#4355<dbt-labs/docs.getdbt.com#4355> are saying something similar about use-cases for the delete+insert incremental strategy.
Our documentation currently states that unique_key is expected to be truly unique here<https://docs.getdbt.com/docs/build/incremental-models#defining-a-unique-key-optional>, but then it also gives guidance<https://docs.getdbt.com/reference/resource-configs/snowflake-configs#merge-behavior-incremental-models> on using delete+insert within dbt-snowflake when it isn't truly unique.
For the sake of this discussion, let's assume that we want to support the use case of non-unique keys with delete+insert specifically, but expecting truly unique keys with the other incremental strategies.
The delete+insert strategy in dbt-snowflake
The delete+insert strategy<https://github.com/dbt-labs/dbt-snowflake/blob/0374b4ec948982f2ac8ec0c95d53d672ad19e09c/dbt/include/snowflake/macros/materializations/merge.sql#L35-L38> in dbt-snowflake defers to the implementation of default__get_delete_insert_merge_sql<https://github.com/dbt-labs/dbt-core/blob/c2bc2f009bbeeb46b3c69d082ab4d485597898af/core/dbt/include/global_project/macros/materializations/models/incremental/merge.sql#L59> within dbt-core).
The delete+insert strategy in dbt-core
It only does a delete portion if unique_key is defined, otherwise, it only does an insert.
When it does a delete, it has two different code paths introduced in dbt-labs/dbt-core#4858<dbt-labs/dbt-core#4858> depending on if unique_key is a list or not.
1. If unique_key is a list:
https://github.com/dbt-labs/dbt-core/blob/c2bc2f009bbeeb46b3c69d082ab4d485597898af/core/dbt/include/global_project/macros/materializations/models/incremental/merge.sql#L65-L77
1. If unique_key is a single column:
https://github.com/dbt-labs/dbt-core/blob/c2bc2f009bbeeb46b3c69d082ab4d485597898af/core/dbt/include/global_project/macros/materializations/models/incremental/merge.sql#L79-L89
Your suggestion
Your suggestion is to use:
delete from big_table
where (version_code ,bu_agg) in (
select version_code ,bu_agg from big_table__dbt_tmp
)
Your suggestion looks really close to that 2nd code path!
So it would essentially involve eliminating the 1st code path in favor of the 2nd (with some small tweaks, of course).
Summary
To the best I can tell, the current logic is expected to give correct results on tiny data sets but it quickly runs into severe performance issues with normal-sized data sets.
We'd need to try out your suggestion with a similar-sized data set (20K pre-existing x 20K updates) to see if it executes in a reasonable amount of time.
-
Reply to this email directly, view it on GitHub<#150>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BETND4CJEHVAHO5CTQLAPVTYJIFBXAVCNFSM6AAAAABATVI4T2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJUGY2DQNJZG4>.
You are receiving this because you were mentioned.Message ID: ***@***.******@***.***>>
[CT-3493]: https://dbtlabs.atlassian.net/browse/CT-3493?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
|
I can absolutely confirm that the current multi-column insert+delete query strategy is beyond inefficient. My tests never finish, even with small amounts of data (~100K rows). The proposed strategy appears like the best approach:
Some initial/quick testing appears to be faster with Any ideas on when this fix might be applied? |
I had to rewrite this macro for an urgent use-case, so here's the improved code (faster and cleaner):
|
resolves dbt-labs#150 Problem The delete query for the 'delete+insert' incremental_strategy with 2+ unique_key columns is VERY inefficient. In many cases, it will hang and never return for deleting small amounts of data (<100K rows). Solution Improve the query by switching to a much more efficient delete strategy: ``` delete from table1 where (col1, col2) in ( select distinct col1, col2 from table1_tmp ) ```
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days. |
Removing the stale label as this is still relevant |
The same issue presists in DBT cloud, but DBT support says that they are not planning to change delete + insert approach anytime in future. |
Is this a new bug in dbt-core?
Current Behavior
The scenario is as follows:
• A model that is loading 20k rows
• Is setup as incremental , delete+insert
• The unique_key is an array of 2 columns
• The 2 columns (unique_key) define a single “partition” with all columns (200k)
• The tmp table creation executes fast as expected
• The delete takes hours until we cancelled it
• The execution plan shows that the delete statement is combining 20k * 20k = 200 million rows
The configuration is
The delete statement is assuming that the key is really unique. That probably is not the case if you are using "delete+insert".
Expected Behavior
The delete statement should not assume the unique key.
My suggestion is to use :
Steps To Reproduce
• Setup the model as incremental , delete+insert
• The unique_key is an array of 2 columns
• Build the model
• Check the execution plan
Relevant log output
No response
Environment
Which database adapter are you using with dbt?
No response
Additional Context
No response
The text was updated successfully, but these errors were encountered: