Cumulus ETL 2.0 released #39

mikix · 2024-10-29T15:05:50Z

mikix
Oct 29, 2024
Maintainer

Cumulus ETL 2.0 is now available from Docker.

The big change is that we now require some information about where the input data came from.
This is so that (a) that debugging information is available & query-able and (b) so that we can support "completion tracking" (see below).

Specifically, you now need to either keep your log.ndjson bulk export log alongside your input data (recommended) or you need to manually pass --export-group and --export-timestamp to the ETL to provide that data.

And because of the way completion tracking works, you may need to export your data in a specific order. See the bulk export docs for more details.

(While this is flagged as the 2.0 release, the ETL uses rolling releases, where usually you grab the latest code from the tip of development, so the 2.0 designation is more of a marker that we changed the CLI contract a bit. If you need to keep using the 1.x release a little longer to adapt your scripting, you can pull down the smartonfhir/cumulus-etl:1 release from Docker.)

What is Completion Tracking?

A problem that occurs with Cumulus is that long running ETL operations can interrupt researchers performing studies / running SQL.

Imagine that you haven't updated your Cumulus healthcare data in a few months. You want to go to your EHR and export all the updates in the past few months using an export with the _since parameter. Great. And you start importing the data into Athena via the ETL as you get it. So you might end up importing Conditions and Encounters before you import Observations. And any researcher running SQL queries against the database will see those inconsistent/incomplete results until you finish.

So the ETL now attaches metadata to each import of data that tags which Encounters have all their associated bits of data in Athena (and the Library looks for that metadata, ignoring any Encounters that are not yet completely ingested).

This is called completion tracking, and helps avoid the problem above. But it does require knowing a little more about where data came from and when it was exported.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cumulus ETL 2.0 released #39

{{title}}

Replies: 0 comments

Select a reply

Cumulus ETL 2.0 released #39

mikix Oct 29, 2024 Maintainer

What is Completion Tracking?

Replies: 0 comments

mikix
Oct 29, 2024
Maintainer