-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement microbatch incremental strategy #825
Conversation
f5d145c
to
23d7283
Compare
{%- if end_time -%} | ||
{%- do incremental_predicates.append("cast(" ~ event_time ~ " as TIMESTAMP) < '" ~ end_time ~ "'") -%} | ||
{%- endif -%} | ||
{%- do arg_dict.update({'incremental_predicates': incremental_predicates}) -%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preps for replace_where strategy by adding
cast(<event_time> as TIMESTAMP) >= <start_time> and cast(<event_time> as TIMESTAMP) < <end_time>
as an incremental predicate.
and columns[name]["description"] != (column.comment or "") | ||
): | ||
return_columns[name] = columns[name] | ||
if name in columns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure exactly what introduces this uncertainty, but I've experimentally observed that sometimes config_column is a dict and sometimes its a ColumnInfo, and these types have different access methods for getting description.
relation: True | ||
columns: True | ||
description: This is a microbatch model | ||
columns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added my own schema with column comments to supplement the included tests, since hitting comments originally broke my implementation despite passing the included functional tests.
@@ -1 +1 @@ | |||
version: str = "1.8.7" | |||
version: str = "1.9.0b1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall the version bumping be done in a separated PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can if you would prefer. My reasoning is that this is more of just a miss (not having this from the start in the 1.9.latest branch), and saves another round of integration tests running just to merge the version (which is its own issue that at some point I should address). When we release 1.9.0, that will be its own version PR.
Resolves #824
Description
Implements the microbatch incremental strategy: https://docs.getdbt.com/docs/build/incremental-microbatch
Core idea is that dbt will determine slices of time to break up an insert into multiple statements; we run a replace-where with those slices so that any old data is replaced by the newest version of that data. This makes it much easier for users to back fill, and on failure, only rerun the slices that failed.
I have to cast the column to TIMESTAMP, as if your event_time column is a date, Databricks casts the conditions to date and then it looks like
replace where date >= X and date < X
I also hit an issue with column comments that I think was introduced in dbt-core 1.9.0b2 that I have fixed here.
Checklist
CHANGELOG.md
and added information about my change to the "dbt-databricks next" section.