best practices for distributing metadata/cached results? #509

bbolker · 2021-06-13T23:46:02Z

bbolker
Jun 13, 2021

Some aspects of this question might be slightly non-targets-specific.

How do people handle versioning and distribution of intermediate (and final) results across different platforms? In particular: anything downstream of the top-level inputs (functions and raw data) is potentially problematic for version control because (1) the objects may be large and (2) the objects are often binary, making version control systems less useful/granular (although not useless) [these problems interact, because large binary objects are usually not subject to differential versioning ...]

This is not a problem if the workflow is reasonably fast on all platforms, or if all collaborators are using a shared space for their files. If not (in particular, if there are specific results that one would like to cache because they are slow to recompute), is there a recommended way to put them into version control with targets? Do I need to put the entirety of the _targets directory under version control? (Presumably just putting the expensive/slow-to-compute object under version control will work badly because the corresponding metadata will be missing?)

(By "version control" here I really mean "synchronizable shared file space" - presumably _targets could also be a link to a dropbox/google drive/AWS bucket/whatever that got rsynced periodically. Is there any potential for allowing dependencies on hashes of remote objects?)

In good old make, one would simply copy the desired objects to the shared/mirrored file space, relying on the mirroring to update the time stamp as necessary ...

If there is a better forum for discussing this issue, please feel free to let me know.

Answered by wlandau

Jun 14, 2021

For syncing an entire _targets/ directory after the pipeline is finished, here are some options that come to mind:

Git LFS
https://dvc.org/doc/start/data-versioning, suggested by @liutiming in #305. The experimental package at https://github.com/andrewcstewart/dvc-r looks handy.
aws.s3::s3sync(path = "_targets")

Another option is to push only _targets/meta/meta to GitHub and use the target-by-target AWS S3 integration in https://books.ropensci.org/targets/cloud.html for the data in _targets/objects, but the API/bandwidth costs could add up if you have a lot of targets.

Data versioning as a feature is outside the scope of targets, but I did redesign the data store to make it more amenabl…

View full answer

wlandau · 2021-06-14T02:27:24Z

wlandau
Jun 14, 2021
Maintainer

For syncing an entire _targets/ directory after the pipeline is finished, here are some options that come to mind:

Git LFS
https://dvc.org/doc/start/data-versioning, suggested by @liutiming in Best practice to compare the results before and after code change #305. The experimental package at https://github.com/andrewcstewart/dvc-r looks handy.
aws.s3::s3sync(path = "_targets")

Another option is to push only _targets/meta/meta to GitHub and use the target-by-target AWS S3 integration in https://books.ropensci.org/targets/cloud.html for the data in _targets/objects, but the API/bandwidth costs could add up if you have a lot of targets.

Data versioning as a feature is outside the scope of targets, but I did redesign the data store to make it more amenable to data versioning/VCS than drake ever was: https://books.ropensci.org/targets/drake.html#lighter-friendlier-data-management. As you noted, the demand for robust data versioning is not unique to targets, and I think it is only a matter of time before a sufficiently R-friendly DVC/VCS tools matures and becomes standard practice.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

best practices for distributing metadata/cached results? #509

{{title}}

Replies: 1 comment

{{title}}

Select a reply

best practices for distributing metadata/cached results? #509

bbolker Jun 13, 2021

Replies: 1 comment

wlandau Jun 14, 2021 Maintainer

bbolker
Jun 13, 2021

wlandau
Jun 14, 2021
Maintainer