use_data_repository #15

benmarwick · 2017-07-24T18:10:05Z

There might be a place for a use_zenodo https://github.com/ropensci/zenodo/blob/master/README.md

The text was updated successfully, but these errors were encountered:

MartinHinz · 2017-07-25T12:27:06Z

Great Idea!
The way would than be:

Download the docker file from dockerhub
push file to zenodo

Because this way we would not need any additional dependencies on the local computer, but we would be dependent on dockerhub to process the files for us!?

dakni · 2017-07-25T12:31:14Z

indeed!

You can login to Zenodo using github; an access token is easily generated under "applications".

@MartinHinz why using the file from dockerhub? when the docker container is successfully build on travis one can directly use this one. Or did I understood some parts of the workflow wrong? This way one does not need to store further environmental variables on travis.

MartinHinz · 2017-07-25T12:37:19Z

Right, damn, correct. So it is accessing from travis, not from dockerhub. My mistake

dakni · 2017-07-25T12:47:48Z

Here is a nice blog post for Zenodo and github [though without API]: http://computationalproteomic.blogspot.de/2014/08/making-your-code-citable.html

--> basically there is more in archiving then just uploading. since we want a DOI etc. perhaps one should put the detailed instructions in how to connect to Zenodo and create a function that creates a container ready for upload to Zenodo?

btw: ropensci package causes an error when trying to create a repo [my zenodo token is chosen per default]:

zen_create("test")
Error in handle_url(handle, url, ...) : 
  Must specify at least one of url or handle

MartinHinz · 2017-07-25T12:48:48Z

Ignore my last post, still not fully familiar with the docker concept, I suppose

benmarwick · 2017-07-25T12:57:39Z

I was imagining this being an infrequent, deliberate, action, not part of the continuous integration cycle.

For example when you submit your article for peer review, you use_zendo() to create a json metadata file, create a repo on https://www.zenodo.org/, and push the whole project to that repo. Then you get a snapshot of the project at that moment, with a DOI to put in the text of the paper.

Then, after peer review and your paper is accepted 🎉, you use_zenodo() again to update the repo with the final file set. I think zenodo has versioned repos, so you can have the same DOI for the repo, but different hashes for each version.

MartinHinz · 2017-07-25T13:05:32Z

From what I read, Zenodo archiving something from Github is connected to making a release. And if the connection exists, Zenodo than makes a snapshot from every released version. So is this our way to go?
See eg ropensci/RNeXML#96

benmarwick · 2017-07-25T13:14:27Z

I think we can do it directly from our console to zenodo. But it looks like the zenodo package actually does not have any functions we can use yet (karthik/zenodo#14). So let's put this on hold until that pkg gets a bit more love.

nevrome · 2017-07-25T13:17:09Z

Just a minor comment independent of the actual implementation: In my opinion functions like use_zenodo() should have a certain threshold of control questions like devtools::release(). It's a pretty big thing to release a paper on zenodo. A set of questions can prevent accidental and immature releases.

benmarwick · 2017-07-25T13:28:07Z

Yes, that is an excellent suggestion, I agree. I guess that @karthik has something like that already in mind for zenodo::zen_file_publish

MartinHinz · 2017-07-25T13:37:08Z

Wouldn't it be in our case the most convenient way a two step process, with

one function (eg. rrtools::use_zenodo()) to establish a connection between the github repo and zenodo,
and another function(eg. rrtools::make_github_release() to trigger a release on github (guess) and
- [automatically a zenodo publish] if the connection is established,
- [nothing] otherwise?

benmarwick · 2017-07-25T13:59:27Z

Yes, that could work. It seems more natural to me to connect from the local repo on my computer directly to zenodo, without counting on GitHub in the middle. That would be simpler and more flexible, to me at least. But let's see what direction they take with the zenodo pkg as it develops further.

MartinHinz · 2017-07-25T14:50:35Z

I give you that. The whole thing is centered around Github, so it seemed to me natural to use the existing link Zenodo <-> Github to make it happened. But when thinking about it, at least you could use use_compendium, use_mit_license, use_readme_rmd and use_analysis, and also use_testthat without any Github integration.

So surely you are right and we should not make that dependent on Github repo being in existence.

karthik · 2017-07-26T20:41:20Z

The whole thing is centered around Github

And the plan is to take advantage of all of that. So the most recent project I worked on with Kirill Muller is Travis and Tic (both of which are available as beta release on ropenscilabs). In short, you can set up a recipe for Zenodo (leveraging the Zenodo package) to create a release for software and data at whatever interval (with versioning support). Once you've set it up and authorized a token, it should just run for any project.

benmarwick · 2017-07-27T00:48:15Z

Thanks Karthik! Are there any of these recipes around for us to take a look at?

We should also consider here:

https://github.com/ropensci/rfigshare
https://github.com/IQSS/dataverse-client-r
https://github.com/DataONEorg/rdataone
https://github.com/chartgerink/osfr (mainly for private repos)

karthik · 2017-07-27T01:34:38Z

Hi @benmarwick!
There are several recipes to look at now, but none are related to data in particular. But it would be the same logic, and other than S3 class support for a common package (to support data deposition), all the functionality will come from individual packages.

Few recipes to consider:

A tic file for automatic packagedown docs: https://github.com/krlmlr/tic.package

Automatically deploying to drat: https://github.com/krlmlr/tic.drat

A rmarkdown site: https://github.com/krlmlr/tic.website

An automatic bookdown book:
https://github.com/krlmlr/tic.bookdown

Great idea to add figshare and dataverse. figshare might be a challenge because they have never prioritized their API and are solely focused on enterprise customers. But we can try.

benmarwick · 2017-07-27T16:06:54Z

Thanks again, those are useful to see.

Currently I think we want to be pushing our compendium to a data repo independently of our actions on GitHub and Travis.

As @nevrome notes above, making a deposit to a data repo should be a deliberate, infrequent action in the life of a project, so we want it to be separate from the push-to-github-trigger-travis process. I'm imagining this just happens 1-3 times in the life of a project.

My guess is that we could have something like a use_data_repository(what = ".", repo = c("figshare", "dataverse", "osf", "dataverse", "zenodo")) function, which triggers some follow-through steps in the console for the user to confirm some details, and get information to create metadata, before pushing to the data repository. Depending on the repo chosen by the user, we then get the other pkgs to do the heavy lifting with the repo API.

What does everyone think?

benmarwick · 2017-10-06T18:55:16Z

I was just reminded by @steko about https://frictionlessdata.io/ and pkgs https://github.com/ropenscilabs/datapkg and https://github.com/christophergandrud/dpmr

These look neat, though I've not seen them is use in the wild anywhere, and their download stats are modest.

Has anyone else come across these in the research literature? Worth mentioning in the readme?

nevrome · 2017-10-11T08:48:52Z

Never seen this. But it seems to be a thing. So why not?

benmarwick · 2018-01-03T05:44:27Z

Bookmarking this rOpenSci discussion that just appeared: ropensci-archive/doidata#1 Seems like they might be about to develop a pkg that will answer many of our needs here. Some discussion on twitter at https://twitter.com/noamross/status/948340525492555776

Hopefully that pkg will contain a function to deposit data and obtain a DOI (to a variety of repositories), although I guess that task might be much more complex than getting data using a DOI.

noamross · 2018-01-03T05:49:58Z

@benmarwick Right now we're just thinking about downloading the data given a DOI

januz · 2018-10-16T00:51:53Z

Has there been progress on this front? What are your current recommendations for getting a DOI associated with the state of a compendium when the corresponding manuscript was submitted/revised/published? Thanks!

benmarwick · 2018-10-16T06:01:06Z

We haven't seen any recent developments that have made automating this step simpler or more obvious to implement, I mean there is so much variation in current practice it's hard to know what defaults make the most sense.

My current recommendations are to use a hook provided by the data repository service (e.g. Zenodo, OSF, Figshare have this) to connect to the GitHub repo with the compendium, and then make a snapshot a version of the GH repo on the data repo at key points (in OSF this is called 'registering' or freezing a version of the repo). I usually snapshot the repo at the point of submission to the journal, and get the DOI of the repo to include in the text of the manuscript. Then snapshot it again after peer review, and again after final acceptance. The DOI stays the same throughout this process, on OSF at least, and any user can see that the data repository has multiple versions and can browse them easily. The repo versions can be tagged with keywords to indicate what part of the process they relate to.

This all happens outside of R. And for me, at least, it's something I do infrequently, just a few times per year. So it's not urgent for me to automate or highly streamline these steps at the moment. But I'm keen to know more about how others might imagine how these steps could be incorporated into a function!

januz · 2018-10-16T14:50:43Z

@benmarwick Thank you so much for your detailed explanations. I agree that this process is probably something that can/should be done deliberately and "manually".

benmarwick · 2018-10-16T19:28:56Z

Yes, for now manual handling seems like the best option for this step, at least as far as I can see. I'm curious to see what might pop up in the future to change my mind!

januz · 2018-12-03T17:55:52Z

I usually snapshot the repo at the point of submission to the journal, and get the DOI of the repo to include in the text of the manuscript. Then snapshot it again after peer review, and again after final acceptance. The DOI stays the same throughout this process, on OSF at least, and any user can see that the data repository has multiple versions and can browse them easily

@benmarwick Sorry to follow up so late, but I just tested out registration/freezing of an OSF project with associated GitHub repository. From what I can see, the DOI of different registrations is not the same. Instead, the project has a fixed DOI and each registration have different ones.

I understood you in the way that you publish the DOI of the first registration that you create. Did you mean that you share the project's DOI (which stays the same) and people can then navigate to the "Registrations" tab and see the different registrations/snapshots that exist for the project?

Thanks!

benmarwick · 2020-03-19T18:46:58Z

Let's include the approaches discussed here in an informational final step in the readme to suggest how the user can archive their compendiumn on a data repo of their choosing, cf. #56

benmarwick mentioned this issue Jul 25, 2017

Work in progress? karthik/zenodo#14

Open

benmarwick added the enhancement label Jul 25, 2017

benmarwick changed the title ~~use_zenodo~~ use_data_repository Jul 27, 2017

benmarwick closed this as completed Mar 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use_data_repository #15

use_data_repository #15

benmarwick commented Jul 24, 2017

MartinHinz commented Jul 25, 2017

dakni commented Jul 25, 2017

MartinHinz commented Jul 25, 2017

dakni commented Jul 25, 2017

MartinHinz commented Jul 25, 2017

benmarwick commented Jul 25, 2017

MartinHinz commented Jul 25, 2017 •

edited

Loading

benmarwick commented Jul 25, 2017

nevrome commented Jul 25, 2017

benmarwick commented Jul 25, 2017

MartinHinz commented Jul 25, 2017

benmarwick commented Jul 25, 2017

MartinHinz commented Jul 25, 2017

karthik commented Jul 26, 2017

benmarwick commented Jul 27, 2017 •

edited

Loading

karthik commented Jul 27, 2017

benmarwick commented Jul 27, 2017

benmarwick commented Oct 6, 2017

nevrome commented Oct 11, 2017

benmarwick commented Jan 3, 2018 •

edited

Loading

noamross commented Jan 3, 2018

januz commented Oct 16, 2018

benmarwick commented Oct 16, 2018

januz commented Oct 16, 2018

benmarwick commented Oct 16, 2018

januz commented Dec 3, 2018

benmarwick commented Mar 19, 2020

use_data_repository #15

use_data_repository #15

Comments

benmarwick commented Jul 24, 2017

MartinHinz commented Jul 25, 2017

dakni commented Jul 25, 2017

MartinHinz commented Jul 25, 2017

dakni commented Jul 25, 2017

MartinHinz commented Jul 25, 2017

benmarwick commented Jul 25, 2017

MartinHinz commented Jul 25, 2017 • edited Loading

benmarwick commented Jul 25, 2017

nevrome commented Jul 25, 2017

benmarwick commented Jul 25, 2017

MartinHinz commented Jul 25, 2017

benmarwick commented Jul 25, 2017

MartinHinz commented Jul 25, 2017

karthik commented Jul 26, 2017

benmarwick commented Jul 27, 2017 • edited Loading

karthik commented Jul 27, 2017

benmarwick commented Jul 27, 2017

benmarwick commented Oct 6, 2017

nevrome commented Oct 11, 2017

benmarwick commented Jan 3, 2018 • edited Loading

noamross commented Jan 3, 2018

januz commented Oct 16, 2018

benmarwick commented Oct 16, 2018

januz commented Oct 16, 2018

benmarwick commented Oct 16, 2018

januz commented Dec 3, 2018

benmarwick commented Mar 19, 2020

MartinHinz commented Jul 25, 2017 •

edited

Loading

benmarwick commented Jul 27, 2017 •

edited

Loading

benmarwick commented Jan 3, 2018 •

edited

Loading