Cumulus ETL 2.0 released #39
mikix
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Cumulus ETL 2.0 is now available from Docker.
The big change is that we now require some information about where the input data came from.
This is so that (a) that debugging information is available & query-able and (b) so that we can support "completion tracking" (see below).
Specifically, you now need to either keep your
log.ndjson
bulk export log alongside your input data (recommended) or you need to manually pass--export-group
and--export-timestamp
to the ETL to provide that data.And because of the way completion tracking works, you may need to export your data in a specific order. See the bulk export docs for more details.
(While this is flagged as the 2.0 release, the ETL uses rolling releases, where usually you grab the latest code from the tip of development, so the 2.0 designation is more of a marker that we changed the CLI contract a bit. If you need to keep using the 1.x release a little longer to adapt your scripting, you can pull down the
smartonfhir/cumulus-etl:1
release from Docker.)What is Completion Tracking?
A problem that occurs with Cumulus is that long running ETL operations can interrupt researchers performing studies / running SQL.
Imagine that you haven't updated your Cumulus healthcare data in a few months. You want to go to your EHR and export all the updates in the past few months using an export with the
_since
parameter. Great. And you start importing the data into Athena via the ETL as you get it. So you might end up importing Conditions and Encounters before you import Observations. And any researcher running SQL queries against the database will see those inconsistent/incomplete results until you finish.So the ETL now attaches metadata to each import of data that tags which Encounters have all their associated bits of data in Athena (and the Library looks for that metadata, ignoring any Encounters that are not yet completely ingested).
This is called completion tracking, and helps avoid the problem above. But it does require knowing a little more about where data came from and when it was exported.
Beta Was this translation helpful? Give feedback.
All reactions