Making Delta Lake merges faster #7

mikix · 2023-09-05T13:40:28Z

mikix
Sep 5, 2023
Maintainer

BCH has a very large table (observation is over 2TB - next smallest encounter is just 500GB) and merges into it are quite slow (like 40-50min). Increasing our ETL batch size has helped make that less painful (merge time seems mostly tied to table size, not batch size -- as it has to scan most of the table).

I'm just gonna jot down some ideas I've investigated, for posterity and in case folks have suggestions of improvement.

Partitioning

First is partitioning. If I could cut down the amount of the table to scan when merging, I could save a lot of time. But to save time on merges, ideally it would be something that wouldn't ever change because if it did and we didn't look at its old partition, the ETL might insert two copies of the same id (example: partitioning on status and merge A puts in id: 123, status: draft and merge B puts in id: 123, status: final -- if the merge is partitioned to only look at status: final during merge, we would insert a second 123 row).

We could solve this with a post-merge scan, probably on the Library side. So "rarely changing" instead of "never changing" could work -- i.e. maybe an observation creation timestamp or observation category could work.

But categories can be specified multiple times. And its effectiveX fields are all optional and conflicting. So it's not easy to pick just one field (though we could make a Delta Lake generated field and use that, try to have it be pretty smart).

But partitions are not looking super simple anymore.

ZORDER

Next is ZORDER, which orders every parquet file by a listed field. Delta Lake records the min/max values of a bunch of fields in each parquet file. And ordering a file lets that min/max range be smaller. This lets the Delta Lake row-skipping algorithm work better, by skipping more files, and it can work even for fields with high cardinality (like id, which is the field we merge on).

So I tried this, and it didn't meaningfully change merge times. ☹️

Liquid Clustering

Delta Lake folks are tech-previewing something called Liquid Clustering, which aims to replace both partitioning and zordering with an even fancier algorithm. Something to try! But if zordering didn't improve much, I don't know how hopeful I am.

Thoughts?

If folks have any experience with huge delta lakes, drop ideas please.

mikix · 2023-09-12T14:08:17Z

mikix
Sep 12, 2023
Maintainer Author

I wanted to jot down some further notes on using Observation year to partition. It's possible! But is not a small amount of effort / caution. Some issues:

A) Difficulty of grabbing a year. There are several effectiveX variants on Observation, and that field is also not required. We could create & inject a field like generatedEffectiveYear, based on whichever one the ETL finds defined for each row, but that has its own pros and cons.

B) It wouldn't help the MERGE process that much anyway, unless we sorted by year first. Cerner groups at BCH have years all over the place, so we'd end up looking at all the partitions anyway by default. Having the ETL sort the incoming observation by year reduces that problem. It would add a delay, but is possible.

C) Handling updates. If a year gets corrected, it's possible we'd inject a duplicate row. We'd need to add some Library-side process to detect those cases. This is also possible, but new code & practices to implement.

0 replies

mikix · 2023-09-12T14:22:14Z

mikix
Sep 12, 2023
Maintainer Author

The considerations of generating a year field are basically:

gotta make sure we pick a sensible naming scheme for the future
gotta worry about errors in our generating - if we ever need to re-run this on old data, we're going to be very sad
we are not generating valid FHIR anymore, unless we craft a new extension - but not sure how easy a nested field would be for Delta Lake to pick up (needs investigation)

0 replies

dogversioning · 2023-09-12T14:31:20Z

dogversioning
Sep 12, 2023
Maintainer

per a side conversation - one option, though it's not clear it is a :good: option, is to move the 'we will produce FHIR that others can consume' boundary from the ETL output to an export job that happens after the data is available in Athena.

0 replies

mikix · 2023-10-26T12:30:33Z

mikix
Oct 26, 2023
Maintainer Author

I just updated Cumulus ETL to use Delta Lake 3.0, which promises (and seems to deliver) on faster merges. About 40% faster by my quick testing. So that helps!

Liquid clustering is currently targeted at Delta Lake 3.1.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making Delta Lake merges faster #7

{{title}}

Replies: 4 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Making Delta Lake merges faster #7

mikix Sep 5, 2023 Maintainer

Partitioning

ZORDER

Liquid Clustering

Thoughts?

Replies: 4 comments

mikix Sep 12, 2023 Maintainer Author

mikix Sep 12, 2023 Maintainer Author

dogversioning Sep 12, 2023 Maintainer

mikix Oct 26, 2023 Maintainer Author

mikix
Sep 5, 2023
Maintainer

mikix
Sep 12, 2023
Maintainer Author

mikix
Sep 12, 2023
Maintainer Author

dogversioning
Sep 12, 2023
Maintainer

mikix
Oct 26, 2023
Maintainer Author