Thoughts on data storage locations other than _targets #297

wlandau · 2021-02-05T05:05:16Z

wlandau
Feb 5, 2021
Maintainer

Update 2021-04-08

I actually ended up implementing the capability to set the data store to paths other than _targets/ (see #407). The use cases just kept piling up, and RStudio Connect was a big one. I still do not like this feature because we lose one of the guardrails protecting reproducibility, but it is no longer possible to avoid. And I think the _targets.yaml/tar_config_set()/tar_config_get() interface handles this as safely as we can hope.

Initial thoughts

A pipeline's data store is always a folder named _targets/ at the project root. Understandably, some users want to set the path to something other than _targets/. However, and the perils and limitations would be too egregious, and the benefits would not go far enough.

Readability: The _targets/ convention enforces standardization and thus readability, reproducibility, transparency, and maintainability.
Collaboration: Users understandably want to share output among multiple collaborators. If the data store lived a shared file space, multiple tar_make()'s could write to it simultaneously, and race conditions would constantly corrupt the output. Sharing requires serious version control, e.g. Git/GitHub for small projects and Git LFS for medium ones. (Maybe DVC versioning for large projects, as mentioned here).
Sessions: tar_read() runs in the current R session, whereas tar_make() does its work in a reproducible external callr process. Both R sessions need access to the same data store, and ensuring agreement would be awkward and brittle if the store path were custom. I see no good way to do this.
- If the store path were an argument to functions like tar_make() and tar_read(), then the user would need to manually set it every time, and this is too easy to forget.
- A session-dependent global option in the parent process would incur a hidden dependency on the global environment and thus break the encapsulation that the external process needs for reproducibility.
- An environment variable in the project's local .Renviron file could send the same path to both processes, which gets us closer to a solution. However, there is still room for confusion and careless errors from ad hoc calls to Sys.setenv(). In addition, the .Renviron approach would not cover Compatibility with Shiny #291, the most compelling use case. Shiny apps would need to overwrite their own .Renviron files at runtime (depending on user-specific project storage) which will never be possible in production.
Design: the data store's design treats output data in _targets/objects/ just like the dynamic files you declare with tar_target(format = "file"). In other words, internally, all files are dynamic files relative to the project root. If the store path were custom and you changed this path mid-project, it would invalidate most of your targets. So either (1) custom paths would be disappointing, or (2) the store would need a non-back-compatible redesign and would be more difficult for me to maintain the end.

nsheff · 2021-02-05T23:06:15Z

nsheff
Feb 5, 2021

Fair enough.

My use case is that I might generate a huge target that takes 2 hours to create. Now, I want to re-use that across multiple projects. Maybe even across multiple users. I could just save it, move it to a separate location, and then load it as a separate file in a different project, since the target file is just a serialized version of the object (right?). So, what I'm proposing is simpler than what you're suggesting above -- it would be super convenient to have this ability built-in to targets. You wouldn't allow the ability to generate targets in separate location. So, what about a simplified version of 'remote targets' that isn't a full blown customizable _targets/ path, but provides a few benefits...

If you allowed targets to 1) read targets from a parent location, and 2) copy them to an external folder from local, that would do it, I think, while avoiding many of the issues above.

Say, something like,tar_option_set(external=list("/path/to/external/targets")). With that set, when checking for built targets, if it doesn't find it in the local _targets as usually, before building, it first checks in the external priority list. If not found, then it just builds it (locally) as normal. But, if found, it can load it from external. No race condition problem there, it's just reading. it only writes to local.

Then a new function tar_share() could save a local target into a global location. I suppose here you could potentially have a race condition, but that can be solved with simple lock files. And it's not messing at all with any of the local targets, since it's totally external. Also, this isn't happening with tar_make -- it takes an explicit call to share this target. So the probability of race conditions is really really low anyway.

Point 2 is solved I think in this scenario. It's not full-blown sharing, it's... deliberate sharing only. Would tar_make be able to get the option from the tar_option_set? That would solve point 3. The pared-down use case also solves point 4. For point 1...well, I think it's worth the benefit of having at least a possibility to easily re-use built targets across projects. It doesn't have to be default, but would be great if it were just possible.

6 replies

nsheff Feb 11, 2021

Hmm -- I hadn't thought of setting an input file as a target file from another project...

But am I thinking about it incorrectly? As I think through this, it doesn't quite work seamlessly in the way I'd imagined. For one thing, it will duplicate the serialized object. If I use an input file which is a serialized target, and provide a function that loads and returns that object, then targets will duplicate it into the folder of the current project as another serialized object. This means both loss of disk space, and also compute time to re-serialize a huge object. So, sharing such a large serialized target file across a bunch of projects isn't feasible.

If instead, targets had some provision to read that shared target directly, it would explicitly not re-serialize and re-save.

To add to this after thinking some more, one possibility is to have your "big serialized objects" not as targets, but as external files in a separate folder, which you refer to using input and output files. In this case, you'd save the file path, but not the file itself in a target. This appears to work on the surface to solve the above, but then you can't load the data in with tar_load, right? that will only work if the big object is actually registered as a target, which means it will have to be copied over in the local space as a serialized target object

To me the only solution still seems to be some kind of provision to read actual serialized targets from an external folder.

wlandau Feb 11, 2021
Maintainer Author

Most datasets are not too big to be copied once, and creating efficient targets for the immediate use case will always require some planning from the user. (A more involved example is batching). In your case, you could create a single encompassing target that reads the input file, analyzes it, and returns a strategically lightweight set of results. If you really need to, you could even read directly from the _targets/objects/ folder of another project, e.g. readRDS("/different/path/another_project/_targets/objects/upstream_target").

User-side workarounds are straightforward compared to the developer-side problems of certain niche features (architectural redesign, developer burden to maintain, potential conflict with other new features later in development, user burden to understand, vulnerability to target invalidation, new reproducibility pitfalls). An explicit cache sharing mechanism would incur too many complications and perils, and it does not belong in the targets package itself. For use cases that become common enough, external target factories could implement user-side guardrails to ensure efficient usage of storage. stantargets::tar_stan_mcmc_rep_summary() is one example.

nsheff Feb 11, 2021

Most datasets are not too big to be copied once

True, but I was thinking of sharing a large reference object across many projects. So it would be copied once per project. I guess this may be a price worth paying in some cases.

In your case, you could create a single encompassing target that reads the input file, analyzes it, and returns a strategically lightweight set of results.

Also true -- but it assumes I only want to use a lightweight set of results. In some cases I want to actually use the full dataset, but not re-serialize it and store it on disk twice. Something like a reference genome I've modified a bit. I don't want copies all over the place, but I use it everywhere.

User-side workarounds are straightforward compared to the developer-side problems of certain niche features...too many complications and perils...does not belong in the targets package itself

Ok, I think I may be able to come up with a convenient workaround for this use case that could exist outside of targets. I would like your opinion on a few thoughts in that regard:

What do you think of using a symbolic link or a hard link in the _target/objects/ folder? Will that cause issues for targets?
What do you think of just an extension to tar_load that could do: tar_load("other_project::target")? I could just wrap/override tar_load with a new function that first tries targets::tar_load, and if there's no target of that name, loads it from another project. Or, requires other_project::target synatx of some sort. Or, it can even just be a separate function: tar_load_from_other_project(). I don't really see how this leads to pitfalls and niche features, but what do you think?
I tried to load("_targets/objects/X") but it seems they are stored differently from a typical R object. Is there a way to just load a target object from disk directly? If so, I can really easily just write the function I want.

I guess what I want is functionality external to targets that exists in simpleCache (described in sharing caches across projects). What I want is the best of both worlds -- the convenient and simple cache-sharing of simpleCache, with the awesome flexible target building system in targets. So maybe this is a separate package, or an extension to simpleCache to allow it to use targets to build caches, or maybe just be a few lines of code in a separate wrapping tar_load function I would use for my projects.

wlandau Feb 11, 2021
Maintainer Author

What do you think of using a symbolic link or a hard link in the _target/objects/ folder? Will that cause issues for targets?

If the link lives externally and points to _targets/objects/, that could work as a way to make access to input files a little easier.

What do you think of just an extension to tar_load that could do: tar_load("other_project::target")? I could just wrap/override tar_load with a new function that first tries targets::tar_load, and if there's no target of that name, loads it from another project. Or, requires other_project::target synatx of some sort. Or, it can even just be a separate function: tar_load_from_other_project(). I don't really see how this leads to pitfalls and niche features, but what do you think?

I could definitely imagine shorthand for withr::with_dir("other_project_root", tar_load(other_target)) in a tflow/tarflow-like extension external to core targets.

I tried to load("_targets/objects/X") but it seems they are stored differently from a typical R object. Is there a way to just load a target object from disk directly? If so, I can really easily just write the function I want.

For most targets, readRDS("_targets/objects/x") should work. Otherwise, it depends on the value of format that was supplied to tar_target(). For example, for "fst_tbl" targets, tar_read(x) is equivalent to tibble::as_tibble(fst::read_fst("_targets/objects/x")).

I guess what I want is functionality external to targets that exists in simpleCache (described in sharing caches across projects). What I want is the best of both worlds -- the convenient and simple cache-sharing of simpleCache, with the awesome flexible target building system in targets. So maybe this is a separate package, or an extension to simpleCache to allow it to use targets to build caches, or maybe just be a few lines of code in a separate wrapping tar_load function I would use for my projects.

I am thinking about similar workarounds for #291 (which will amount to withr::with_dir()/withr::local_dir()). Right now I suspect it may come down to

Identifying the path of the other project in some convenient way that allows tar_target(format = "file") for the current project to track the right files from the external project, and/or
Wrappers that safely/temporarily switch the working directory so tar_read() and friends can pull data from the external project.

For #291, I plan to make an example Shiny app powered by targets internally with storage at tools::R_user_dir(), then write a chapter on it in the manual. Maybe some wrappers will come out of that, I am not sure.

nsheff Mar 12, 2021

I could definitely imagine shorthand for withr::with_dir("other_project_root", tar_load(other_target)) in a tflow/tarflow-like extension external to core targets.

Alright, I've now written such a package. I call it unitar. It's very simple; it just provides a few helper functions that allow me to really easily link to external targets folders so I can load across projects. It basically wraps tar_* calls with some simple lapply across multiple folders using withr::with_dir as you suggested. I also went a step further and wired it to pepr, so that you can list the possible external targets in a project configuration file.

Would love to hear your thoughts.

rgayler · 2021-03-12T04:39:44Z

rgayler
Mar 12, 2021

Thanks @wlandau and @nsheff , that has been a very helpful discussion. (@softloud you should take a look at this.)

I have a use case, where I want to run some number of separate computational experiments and also publish a synthesis of the set of experiments. Think of it as being like a thesis, where there are multiple non-trivial experiments which may be written up separately (say, for conference papers), then a separate, non-trivial exercise to write up the experiments as a connected whole (the thesis).

Each experiment would be implemented as an independent targets project in its own directory. Each experiment project might have some number of project publications (loosely defined). Those project publications would live in the corresponding experiment directory and any publication-specific computations and rendering of the publication would be managed by the targets workflow for that project. The publication-specific computations and rendering would be represented as side-chains on the targets dependency DAG for the experiment project.

I envisage the synthesis publication being implemented as a separate targets project (effectively a sibling project to the experiment projects). The synthesis project needs to be able to access results from the experiment projects. That access would be read-only. The synthesis targets project would look after any synthesis-specific computation and the rendering of the synthesis publication.

Unlike @nsheff 's use case, I am not particularly focussed on sharing large data objects, rather I am focussed on enabling read-only, cross-project data flows to support carving a large super-project into manageable loosely connected sub-projects.

I had initially been thinking I might have to dynamically symlink to "_targets/objects/" in the experiment projects, but the withr::with_dir()/withr::local_dir() wrapper sounds like a better idea.

I will look at unitar for inspiration. For my use case I won't need unitar's priority list of projects to search, because I would always be explicitly referring to a specific project.

2 replies

nsheff Mar 12, 2021

I think unitar would satisfy this use case, but let me know. You don't have to use the multiple folder options for each one if you don't want to... but it simplifies things because then you can just create a list of the sub-projects, and then load them all with the same function call. As long as you don't repeat target identifiers you'd be fine.

nsheff Apr 27, 2021

FYI, I just added some functionality that addresses this more directly into unitar, check it out if you like.

wlandau · 2021-03-12T20:40:33Z

wlandau
Mar 12, 2021
Maintainer Author

Great work, @nsheff! unitar_load() opens up all sorts of possibilities for unifying sub-projects, chiefly exploratory data analysis.

With some minor extensions, it might also allow some projects to take artifacts from other projects as input in the pipeline. If project B pulls from project A, it would be ideal if project B automatically reruns some targets when the upstream files from project A change. Thinking out loud: you could have one file-tracking target with tar_target(file_a, unitar_path(...), format = "file") and then another downstream target with something like tar_target(data_a, unitar_read_from_path()). A function like unitar_read_from_path() could temporarily switch to the project belonging to file_a (inferred from the path) and call tar_read() to get the data. One could even write a target factory to do both at once.

5 replies

wlandau Mar 12, 2021
Maintainer Author

All this is great for local files. But eventually, it would be nice to figure out something workable for large projects hosted remotely. Last week, I kind of started this in https://github.com/wlandau/targets-minimal#continuous-deployment and targets::tar_github_actions(). The idea is to run the pipeline in a GitHub Actions workflow every time you make a commit to the main branch. In each run, you pull the results of the previous run, skip any targets that are up to date, complete the pipeline, post the changes to a special Git branch, and optionally host artifacts like Rmd reports on GitHub Pages.

This approach works really well with small projects with low compute demands and small data, and it should work just as well if all your big data ships to AWS, e.g. tar_target(..., format = "aws_qs")). And in principle, pulling the latest results from another project is as simple as git clone --depth=1 (with past historical results available in prior commits). But in terms of computing power, we are limited by what GitHub Actions and the R ecosystem are capable of. Hopefully there will be some way to use AWS Batch/Lambda/Fargate via R/clustermq/future at some point. And I believe GitHub Actions plans to eventually scale its resources.

rgayler Mar 12, 2021

Much appreciated @nsheff and @wlandau . I think that cross-project workflow will be a significant use case.

Where the use case involves some concept of publication (say, manuscript preparation) I suspect it will also be important to support some sort of freezing of a specified subset of the inputs (especially if they are being pulled from another project) so that the manuscript project is based on fixed data as at some known point.

I saw @wlandau 's comment in the targets manual, that unlike drake, targets doesn't have any built in concept of version control and this should be done externally. For this use case (freezing some inputs) I think it would be useful to have some best practice guidance on how this might be implemented. It might also be useful to enable targets to flag in the metadata those targets which should be frozen even if targets does not implement the actual freezing.

wlandau Mar 12, 2021
Maintainer Author

...targets doesn't have any built in concept of version control and this should be done externally. For this use case (freezing some inputs) I think it would be useful to have some best practice guidance on how this might be implemented.

This is what I was trying to address with my previous comment (maybe I got a little verbose). https://github.com/wlandau/targets-minimal#continuous-deployment is a versioning solution for projects with small data. You can set it up in your own project with targets::tar_github_actions(). Historical runs of the project live at the targets-runs branch, which can then be deployed with GitHub Pages (e.g. a preprint). Clone the latest completed run of the targets-minimal example with git clone --depth=1 [email protected]:wlandau/targets-minimal -b targets-runs.

wlandau Mar 12, 2021
Maintainer Author

Before I forget:

For #291, I plan to make an example Shiny app powered by targets internally with storage at tools::R_user_dir(), then write a chapter on it in the manual. Maybe some wrappers will come out of that, I am not sure.

I implemented that Shiny app: https://github.com/wlandau/targets-shiny, https://wlandau.shinyapps.io/targets-shiny/. The documentation is essentially the chapter I promised, and it goes over the major workarounds I went through to get a persistent pipeline running in an app and stored properly.

rgayler Mar 12, 2021

Thanks @wlandau . I am an utter noob wrt git - I haven't progressed much beyond https://xkcd.com/1597/

Now I know where to look.

wlandau · 2021-04-09T02:08:52Z

wlandau
Apr 9, 2021
Maintainer Author

Update: I actually ended up implementing the ability to set the data store to paths other than _targets/ (see #407). The use cases just kept piling up, and RStudio Connect was a big one. I still do not like this feature because we lose one of the guardrails protecting reproducibility, but it is no longer possible to avoid. And I think the _targets.yaml/tar_config_set()/tar_config_get() interface handles this as safely as we can hope.

0 replies

wlandau · 2021-05-29T03:37:43Z

wlandau
May 29, 2021
Maintainer Author

FYI: Target Markdown makes it super easy to have different sub-pipelines with different file systems: https://books.ropensci.org/targets/markdown.html. Just set root.dir differently in different R Markdown reports. This not only takes care of the data store _targets/, but also _targets.R and supporting scripts.

```{r, eval = FALSE}
knitr::opts_knit$set(root.dir = "your/choice/")
```

1 reply

softloud May 30, 2021

Cheers! I didn't know about this, very handy 👍

wlandau · 2021-05-30T15:21:11Z

wlandau
May 30, 2021
Maintainer Author

With Target Markdown, targets is becoming more of a backend tool. That means more of its files may need to be more configurable, including _targets.R. Posted issues #476, #477, and #478.

0 replies

Thoughts on data storage locations other than _targets #297

wlandau Feb 5, 2021 Maintainer

Update 2021-04-08

Initial thoughts

Replies: 6 comments · 14 replies

wlandau Feb 11, 2021 Maintainer Author

wlandau Feb 11, 2021 Maintainer Author

wlandau Mar 12, 2021 Maintainer Author

wlandau Mar 12, 2021 Maintainer Author

wlandau Mar 12, 2021 Maintainer Author

wlandau Mar 12, 2021 Maintainer Author

wlandau Apr 9, 2021 Maintainer Author

wlandau May 29, 2021 Maintainer Author

wlandau May 30, 2021 Maintainer Author

wlandau
Feb 5, 2021
Maintainer

Replies: 6 comments 14 replies

wlandau Feb 11, 2021
Maintainer Author

wlandau Feb 11, 2021
Maintainer Author

wlandau
Mar 12, 2021
Maintainer Author

wlandau Mar 12, 2021
Maintainer Author

wlandau Mar 12, 2021
Maintainer Author

wlandau Mar 12, 2021
Maintainer Author

wlandau
Apr 9, 2021
Maintainer Author

wlandau
May 29, 2021
Maintainer Author

wlandau
May 30, 2021
Maintainer Author