Replies: 4 comments
-
I wanted to jot down some further notes on using Observation year to partition. It's possible! But is not a small amount of effort / caution. Some issues: A) Difficulty of grabbing a year. There are several B) It wouldn't help the C) Handling updates. If a year gets corrected, it's possible we'd inject a duplicate row. We'd need to add some Library-side process to detect those cases. This is also possible, but new code & practices to implement. |
Beta Was this translation helpful? Give feedback.
-
The considerations of generating a year field are basically:
|
Beta Was this translation helpful? Give feedback.
-
per a side conversation - one option, though it's not clear it is a :good: option, is to move the 'we will produce FHIR that others can consume' boundary from the ETL output to an export job that happens after the data is available in Athena. |
Beta Was this translation helpful? Give feedback.
-
I just updated Cumulus ETL to use Delta Lake 3.0, which promises (and seems to deliver) on faster merges. About 40% faster by my quick testing. So that helps! Liquid clustering is currently targeted at Delta Lake 3.1. |
Beta Was this translation helpful? Give feedback.
-
BCH has a very large table (
observation
is over 2TB - next smallestencounter
is just 500GB) and merges into it are quite slow (like 40-50min). Increasing our ETL batch size has helped make that less painful (merge time seems mostly tied to table size, not batch size -- as it has to scan most of the table).I'm just gonna jot down some ideas I've investigated, for posterity and in case folks have suggestions of improvement.
Partitioning
First is partitioning. If I could cut down the amount of the table to scan when merging, I could save a lot of time. But to save time on merges, ideally it would be something that wouldn't ever change because if it did and we didn't look at its old partition, the ETL might insert two copies of the same
id
(example: partitioning onstatus
and merge A puts inid: 123, status: draft
and merge B puts inid: 123, status: final
-- if the merge is partitioned to only look atstatus: final
during merge, we would insert a second123
row).We could solve this with a post-merge scan, probably on the Library side. So "rarely changing" instead of "never changing" could work -- i.e. maybe an observation creation timestamp or observation category could work.
But categories can be specified multiple times. And its
effectiveX
fields are all optional and conflicting. So it's not easy to pick just one field (though we could make a Delta Lake generated field and use that, try to have it be pretty smart).But partitions are not looking super simple anymore.
ZORDER
Next is ZORDER, which orders every parquet file by a listed field. Delta Lake records the min/max values of a bunch of fields in each parquet file. And ordering a file lets that min/max range be smaller. This lets the Delta Lake row-skipping algorithm work better, by skipping more files, and it can work even for fields with high cardinality (like
id
, which is the field we merge on).So I tried this, and it didn't meaningfully change merge times.☹️
Liquid Clustering
Delta Lake folks are tech-previewing something called Liquid Clustering, which aims to replace both partitioning and zordering with an even fancier algorithm. Something to try! But if zordering didn't improve much, I don't know how hopeful I am.
Thoughts?
If folks have any experience with huge delta lakes, drop ideas please.
Beta Was this translation helpful? Give feedback.
All reactions