DBT Marts DAG #274
lprzychodzien
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Airflow and DBT
Currently we use Airflow to orchestrate data ETL jobs for our Postgres Database. DBT is used as the business logic layer onto of the airflow end state data. Both work together to ensure that SageRx delivers reliable data to its desired end state.
Problem
Each Airflow DAG is scheduled to run on various predefined timeframes. As the data becomes more intertwined we have DBT marts that are reliant on data from multiple DAGs. This brings up the issue of aligning DAGs so that the marts are updated with the latest data while not needlessly rerunning things (cost issue).
Purpose
The goal of this discussion is to come to a consensus on which approach is best for where the product is right now. This approach will most likely change as SageRx grows and matures.
Approaches
Approach 1 - Logging Changes
DBT Mart is a stand alone DAG.
Each source DAG logs data changes which are then used to determine what downstream marts should be ran. We only run mart updates on data that has changed.
Example, if FDA exclusions DAG is run but no new data comes in then no need to rebuild the downstream marts.
Pros
Cons
Approach 2 - Airflow Sensors
DBT Mart is a stand alone DAG.
With this approach we utilize Airflow ExternalTaskSensor to make a task run only after another DAG is completed. This works by having Mart sensors check for a final completion task of the source DAG.
Example, if the FDA Excluded DAG had multiple source tables then that DAG would have a final Airflow EmptyOperator (such as FDA_Excluded_Complete) that the marts DAG would be checking for.
Pros
Cons
Approach 3 - DBT within each dependent DAG
DBT work is part of each DAG that relies on it.
This is a simple approach, run the DBT jobs that are dependent on a DAG at the end of said DAG.
This would look like the following: extract -> load -> transform -> DBT tasks.
Pros:
Cons:
Approach 4 - Full Refresh DAG
Separate DAG that runs a DBT ad-hoc full refresh every defined period of time, such as daily (overnight). This is the simplest approach since all DBT logic is self contained and not dependent on upstream Airflow work.
Pros
Cons
Approach 5 - Utilize DBT incremental models
DBT Mart is a standalone DAG.
DBT has built in functionality around DBT building/updating models incrementally (DBT documentation link). This allows for rows that have been updated to drive the downstream DBT table changes, making sure that only the required rows are impacted.
Pros
Cons
Conclusion and Personal Recommendation
Utilizing Approach 5 makes the most sense since the functionality is already built into DBT. However, with the lack of persistent change logging, combining it with approach 1 is the most comprehensive solution. It satisfies both logging from a quality and refresh perspective. It also can help keep operating costs low while SageRx scales.
Beta Was this translation helpful? Give feedback.
All reactions