Cluster utilization on Databricks #6290

brendan-bull · 2022-11-04T13:07:38Z

brendan-bull
Nov 4, 2022

Hi,

I asked this question on the support channel and they encouraged me start a discussion (https://greatexpectationstalk.slack.com/archives/CUTCNHN82/p1667397899097119)

Background:
Let me start by providing some background. We have a fairly typical data validation scenario - We need to run various kinds table, column, and row level validation over large CSVs before sending them to a downstream ETL process. The largest of the CSVs will be a few hundred million rows. When we started exploring GX, we were using the the PandasExecutionEngine because that's what we saw in the tutorial. That quickly ran out of memory on larger test files. We asked a question on the forum and someone pointed us at the spark backend. So, we tried that and it worked locally and with databricks which is where our production workloads would run. The problem we hit is that it takes a long time jobs to complete on spark. It appears as though each expectation roughly corresponds to a single spark job and there's never more than one active job. So we get ~ 25% CPU utilization on the spark cluster and a sample CSV with 100 million rows and 140 expectations (no approx quantile operations either as those are particularly slow) takes almost 50 minutes to complete.

Workaround Attempt
To improve throughput, I tried to programmatically split the expectation suite into multiple suites and run multiple validations as part of the same checkpoint. I'd then glue the validation results back together into a unified result and regen the data docs. The idea here was to give spark more simultaneous jobs to chew on. This improved CPU utilization to ~60% and jobs finished about 2.5x faster. However, there were underlying spark query errors as noted in the Slack link above. It was confirmed in that thread that concurrent validations on the same data source are not supported. Breaking things into separate checkpoints and re-merging would probably be the next step, but before I go there, I wanted to ask if there's a better way.

Spark really is not a necessity for us, it's just a way to get guarantees on the memory consumption so that we can process a large CSV. But it is a means to an end, not an end unto itself.

Final Questions/Comments
Are there other mechanisms that I'm overlooking to improve job speed? Perhaps a simpler way to get multiple jobs to run concurrently?

One can imagine a mode of execution where expectations run over batches and each batch computes minimal statistics/validation examples that are aggregated in a memory bounded way (a bit like Spark does actually but w/o all the spark overhead). This would require yet another execution engine which is probably not what anyone wants, but I'm bring this up more as a thought experiment to help illustrate where our thoughts are at - Just having a simple for loop over the data while computing minimal statistics along the way (I know I'm over simplifying and probably not thinking of more complex validation scenarios like doing things cross column and cross row).

Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster utilization on Databricks #6290

{{title}}

Replies: 0 comments

Select a reply

Cluster utilization on Databricks #6290

brendan-bull Nov 4, 2022

Replies: 0 comments

brendan-bull
Nov 4, 2022