Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhancement: profile all cohorts concurrently #491

Open
TomWhite-MedStar opened this issue Sep 29, 2023 · 1 comment
Open

enhancement: profile all cohorts concurrently #491

TomWhite-MedStar opened this issue Sep 29, 2023 · 1 comment
Labels
enhancement New feature or request

Comments

@TomWhite-MedStar
Copy link
Contributor

Problem Statement

On some databases, profiling DQD for a single cohort may be slower than profiling the entire dataset. For example, on our Databricks instance, it takes longer to profile a single cohort than it does to profile the entire dataset.

Enhancement Request

Add option to profile all cohorts within cohortDatabaseSchema.cohortTableName simulataneously by.

  1. Add a Boolean parameter -- perhaps groupByCohort
  2. When '@groupByCohort' == 'Yes', add a GROUP BY c.cohort_definition_id clause to the generated query
  3. When '@groupByCohort' == 'Yes', add c.cohort_definition_id to the list of fields returned by each query's SELECT statement
  4. When '@groupByCohort' == 'Yes', add cohort_definition_id INTEGER to the table DDL in result_table_ddl.sql
  5. Update conditional SQL generation to make the clauses AND c.cohort_definition_id = @cohortDefinitionId optional; and change them to AND c.cohort_definition_id IN (@cohortDefinitionId). That way, if uses do not pass a value for cohortDefinitionId, DQD will process all cohorts in cohortDatabaseSchema.cohortTableName. Or, users can pass either a single numerical value for cohortDefinitionId or a list of numerical values.

Benchmarking Data

Our OMOP instance has 5M patients and 8 years of EHR data. This generates an OMOP instance with 8.5B records across the non-vocabulary CDM tables (excluding the notes tables).

We use the following DQD configuration to pre-generates SQL code in Spark format:

sqlOnly <- TRUE
sqlOnlyUnionCount <- 100
sqlOnlyIncrementalInsert <- TRUE

When we run the generated SQL code against a Databricks Serverless XL warehouse, it takes:

  • 19.5 minutes to process all DQD queries against the full dataset
  • 21.25 minutes to run the same DQD queries against a single cohort

I suspect that it would take only a small percentage more time to process all cohorts concurrently.

@katy-sadowski
Copy link
Collaborator

Thanks for sharing this idea @TomWhite-MedStar ! This is on theme with the higher level ideas we've been discussing around minimizing unnecessary repetition within & across DQD runs, and performance in general.

We'll probably prioritize the broader overhaul of DQD workflow we were discussing in the Hackathon yesterday before considering any major changes to the check SQL & results/reporting like this one. It's possible we'll address these same requirements differently once we've changed the check execution workflow. And rethinking cohort-level execution will fit in well with subsequent work we hope to do on study/cohort-level data quality. I'll certainly keep this in the backlog to revisit later on!

@katy-sadowski katy-sadowski added the enhancement New feature or request label Sep 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants