enhancement: profile all cohorts concurrently #491

TomWhite-MedStar · 2023-09-29T20:31:28Z

Problem Statement

On some databases, profiling DQD for a single cohort may be slower than profiling the entire dataset. For example, on our Databricks instance, it takes longer to profile a single cohort than it does to profile the entire dataset.

Enhancement Request

Add option to profile all cohorts within cohortDatabaseSchema.cohortTableName simulataneously by.

Add a Boolean parameter -- perhaps groupByCohort
When '@groupByCohort' == 'Yes', add a GROUP BY c.cohort_definition_id clause to the generated query
When '@groupByCohort' == 'Yes', add c.cohort_definition_id to the list of fields returned by each query's SELECT statement
When '@groupByCohort' == 'Yes', add cohort_definition_id INTEGER to the table DDL in result_table_ddl.sql
Update conditional SQL generation to make the clauses AND c.cohort_definition_id = @cohortDefinitionId optional; and change them to AND c.cohort_definition_id IN (@cohortDefinitionId). That way, if uses do not pass a value for cohortDefinitionId, DQD will process all cohorts in cohortDatabaseSchema.cohortTableName. Or, users can pass either a single numerical value for cohortDefinitionId or a list of numerical values.

Benchmarking Data

Our OMOP instance has 5M patients and 8 years of EHR data. This generates an OMOP instance with 8.5B records across the non-vocabulary CDM tables (excluding the notes tables).

We use the following DQD configuration to pre-generates SQL code in Spark format:

sqlOnly <- TRUE
sqlOnlyUnionCount <- 100
sqlOnlyIncrementalInsert <- TRUE

When we run the generated SQL code against a Databricks Serverless XL warehouse, it takes:

19.5 minutes to process all DQD queries against the full dataset
21.25 minutes to run the same DQD queries against a single cohort

I suspect that it would take only a small percentage more time to process all cohorts concurrently.

The text was updated successfully, but these errors were encountered:

katy-sadowski · 2023-09-30T16:31:51Z

Thanks for sharing this idea @TomWhite-MedStar ! This is on theme with the higher level ideas we've been discussing around minimizing unnecessary repetition within & across DQD runs, and performance in general.

We'll probably prioritize the broader overhaul of DQD workflow we were discussing in the Hackathon yesterday before considering any major changes to the check SQL & results/reporting like this one. It's possible we'll address these same requirements differently once we've changed the check execution workflow. And rethinking cohort-level execution will fit in well with subsequent work we hope to do on study/cohort-level data quality. I'll certainly keep this in the backlog to revisit later on!

katy-sadowski added the enhancement New feature or request label Sep 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enhancement: profile all cohorts concurrently #491

enhancement: profile all cohorts concurrently #491

TomWhite-MedStar commented Sep 29, 2023

katy-sadowski commented Sep 30, 2023

enhancement: profile all cohorts concurrently #491

enhancement: profile all cohorts concurrently #491

Comments

TomWhite-MedStar commented Sep 29, 2023

Problem Statement

Enhancement Request

Benchmarking Data

katy-sadowski commented Sep 30, 2023