First sync for a large data set takes very long and can cause crashes #156

njuguna-n · 2024-09-23T07:44:25Z

What feature do you want to improve?
When syncing data for the first time, couch2pg does a good job of copying over a lot of data fast but that means that dbt struggles to keep up and can end up trying to load millions of rows incrementally which has led to issues such as this one.

Describe the improvement you'd like
When the sync is running for the first time there should be a flag set so that either dbt or couch2pg can create the main base tables i.e. document_metadata, contact, and data_record. These tables hold the majority of the data take the most time during dbt runs so pre-populating them would reduce the amount of time required for a sync to complete.

Describe alternatives you've considered
We have manually updated the document_metadata table and deleted some documents but these were temporary measures that should not be done for all production deployments.

The text was updated successfully, but these errors were encountered:

njuguna-n · 2024-09-23T07:58:08Z

There are two possible ways to go about this:

Have the tables being created and populated by couch2pg during that first sync with dbt not running and once the sync has less than a specified number of docs yet to sync then the flag can be removed and dbt can run on the existing and incoming data. This would be the fastest but the schema for the tables would be in two places.
Have the dbt container check the flag and create the tables in a non-incremental way. To keep it simple and efficient the sync would have to run and be nearly complete or with a small number of docs yet to be synced so that the tables are only created once and then once the flag is off dbt can update the tables incrementally.

My preference is for option 1 since the main tables are being populated incrementally. I will start by exploring if there is a way to get the schema from the CHT Pipeline base table package so that we have only one source of truth for the schemas.

njuguna-n · 2024-10-04T16:04:56Z

I did some tests on using batched dbt runs with the following results:

With a single Couch2pg instance replicating about 700k docs at a rate of 1000 docs per batch. I set the dbt batch at 10000 records and dbt took on average 3121 seconds to catch up and process all records. Without batching the average time was 2263 seconds

With 7 Couch2pg instances each replicating ~700k docs in batches of 1000 regular dbt runs were much faster than batched dbt runs. This varied depending on the value I set for the batch size but regular dbt runs were still faster since there was no limit to the number of rows to be processed with each run.

I also tried with 20 Couch2pg instances and added a delay between dbt runs of about 5 minutes to simulate models taking long to complete but even in that scenario regular dbt runs we often faster.

My conclusion from this is that in 99% of cases just using incremental tables with dbt runs as they are will be sufficient and performant but in rare cases like MoH Kenya where we have a lot of instances with millions of docs and constrained database resources then batched dbt runs would definitely help but we would be trading off speed for resource conservation. In this case we can run dbt in batches by setting a flag as an environment variable. I have updated this PR to refelct this approach.

andrablaj · 2024-10-10T11:50:24Z

@njuguna-n dbt 1.9 has a new feature for microbatching incremental models . Could this new feature solve the problem this ticket is stating?

Additionally, you can see more info in this Coalesce presentation, starting from minute 25.

njuguna-n · 2024-10-14T08:23:27Z

@andrablaj Yes, this feature can solve the problem because the incremental model is loaded in batches. The only drawback I see with using this incremental strategy is that the batch size is currently only configurable to be a day and depending on the Couch2pg sync rate and the available Postgres resource the number of documents to be processed might still cause an issue.

I will test this config today and assess whether the CHT Sync and CHT Pipeline PRs for this issue are still required.

njuguna-n · 2024-10-14T15:46:19Z

Testing microbatching with a beta release of dbt 1.9 and this branch of CHT Pipeline I get the error below.

2024-10-14 18:36:57 15:36:57  Running with dbt=1.9.0-b1
2024-10-14 18:36:57 15:36:57  Registered adapter: postgres=1.9.0-b1
2024-10-14 18:36:57 15:36:57  Found 9 models, 27 data tests, 1 operation, 1 source, 693 macros, 8 unit tests
2024-10-14 18:36:57 15:36:57  The selection criterion 'state:modified' does not match any enabled nodes
2024-10-14 18:36:57 15:36:57  Nothing to do. Try checking your model configs and model specification args
2024-10-14 18:36:59 15:36:59  Running with dbt=1.9.0-b1
2024-10-14 18:37:00 15:37:00  Registered adapter: postgres=1.9.0-b1
2024-10-14 18:37:00 15:37:00  Found 9 models, 27 data tests, 1 operation, 1 source, 693 macros, 8 unit tests
2024-10-14 18:37:00 15:37:00  
2024-10-14 18:37:00 15:37:00  Concurrency: 1 threads (target='default')
2024-10-14 18:37:00 15:37:00  
2024-10-14 18:37:00 15:37:00  1 of 9 START sql incremental model v1.dbt_results .............................. [RUN]
2024-10-14 18:37:00 15:37:00  1 of 9 OK created sql incremental model v1.dbt_results ......................... [INSERT 0 0 in 0.11s]
2024-10-14 18:37:00 15:37:00  2 of 9 START sql microbatch model v1.document_metadata ......................... [RUN]
2024-10-14 18:37:00 15:37:00  1 of 1 START batch 2024-10-14 of v1.document_metadata .......................... [RUN]
2024-10-14 18:37:00 15:37:00  1 of 1 ERROR creating batch 2024-10-14 of v1.document_metadata ................. [ERROR in 0.00s]
2024-10-14 18:37:00 15:37:00  Encountered an error while running operation: Compilation Error
2024-10-14 18:37:00   'dbt.artifacts.resources.v1.components.ColumnInfo object' has no attribute 'get'
2024-10-14 18:37:00   
2024-10-14 18:37:00   > in macro get_empty_schema_sql (macros/adapters/columns.sql)
2024-10-14 18:37:00   > called by macro assert_columns_equivalent (macros/relations/column/columns_spec_ddl.sql)
2024-10-14 18:37:00   > called by macro default__get_assert_columns_equivalent (macros/relations/column/columns_spec_ddl.sql)
2024-10-14 18:37:00   > called by macro get_assert_columns_equivalent (macros/relations/column/columns_spec_ddl.sql)
2024-10-14 18:37:00   > called by macro postgres__create_table_as (macros/adapters.sql)
2024-10-14 18:37:00   > called by macro create_table_as (macros/relations/table/create.sql)
2024-10-14 18:37:00   > called by macro default__get_create_table_as_sql (macros/relations/table/create.sql)
2024-10-14 18:37:00   > called by macro get_create_table_as_sql (macros/relations/table/create.sql)
2024-10-14 18:37:00   > called by macro materialization_incremental_default (macros/materializations/models/incremental/incremental.sql)
2024-10-14 18:37:00   > called by <Unknown>

andrablaj · 2024-10-14T16:36:20Z

@njuguna-n per this discussion, the error above might happen because the dbt-core version is ahead of the postgres adapter version (Registered adapter: postgres=1.9.0-b1 might not exist yet). Have you tried forcing a lower version of the postgres adapter?

andrablaj · 2024-10-14T16:42:30Z

Actually, there is a dbt-postgres pre-release 1.9.0-b1. So nevermind.

njuguna-n added the Type: Improvement Make something better label Sep 23, 2024

njuguna-n self-assigned this Sep 23, 2024

This was referenced Sep 30, 2024

feat: run dbt in batches #158

Open

feat: run dbt in batches for document_metadata model medic/cht-pipeline#174

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First sync for a large data set takes very long and can cause crashes #156

First sync for a large data set takes very long and can cause crashes #156

njuguna-n commented Sep 23, 2024 •

edited by andrablaj

Loading

njuguna-n commented Sep 23, 2024

njuguna-n commented Oct 4, 2024

andrablaj commented Oct 10, 2024 •

edited

Loading

njuguna-n commented Oct 14, 2024

njuguna-n commented Oct 14, 2024

andrablaj commented Oct 14, 2024 •

edited

Loading

andrablaj commented Oct 14, 2024

First sync for a large data set takes very long and can cause crashes #156

First sync for a large data set takes very long and can cause crashes #156

Comments

njuguna-n commented Sep 23, 2024 • edited by andrablaj Loading

njuguna-n commented Sep 23, 2024

njuguna-n commented Oct 4, 2024

andrablaj commented Oct 10, 2024 • edited Loading

njuguna-n commented Oct 14, 2024

njuguna-n commented Oct 14, 2024

andrablaj commented Oct 14, 2024 • edited Loading

andrablaj commented Oct 14, 2024

njuguna-n commented Sep 23, 2024 •

edited by andrablaj

Loading

andrablaj commented Oct 10, 2024 •

edited

Loading

andrablaj commented Oct 14, 2024 •

edited

Loading