Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use_data_repository #15

Closed
benmarwick opened this issue Jul 24, 2017 · 27 comments
Closed

use_data_repository #15

benmarwick opened this issue Jul 24, 2017 · 27 comments

Comments

@benmarwick
Copy link
Owner

There might be a place for a use_zenodo https://github.com/ropensci/zenodo/blob/master/README.md

@MartinHinz
Copy link
Contributor

Great Idea!
The way would than be:

  • Download the docker file from dockerhub
  • push file to zenodo

Because this way we would not need any additional dependencies on the local computer, but we would be dependent on dockerhub to process the files for us!?

@dakni
Copy link
Contributor

dakni commented Jul 25, 2017

indeed!

You can login to Zenodo using github; an access token is easily generated under "applications".

@MartinHinz why using the file from dockerhub? when the docker container is successfully build on travis one can directly use this one. Or did I understood some parts of the workflow wrong? This way one does not need to store further environmental variables on travis.

@MartinHinz
Copy link
Contributor

Right, damn, correct. So it is accessing from travis, not from dockerhub. My mistake

@dakni
Copy link
Contributor

dakni commented Jul 25, 2017

Here is a nice blog post for Zenodo and github [though without API]: http://computationalproteomic.blogspot.de/2014/08/making-your-code-citable.html

--> basically there is more in archiving then just uploading. since we want a DOI etc. perhaps one should put the detailed instructions in how to connect to Zenodo and create a function that creates a container ready for upload to Zenodo?

btw: ropensci package causes an error when trying to create a repo [my zenodo token is chosen per default]:

zen_create("test")
Error in handle_url(handle, url, ...) : 
  Must specify at least one of url or handle

@MartinHinz
Copy link
Contributor

Ignore my last post, still not fully familiar with the docker concept, I suppose

@benmarwick
Copy link
Owner Author

I was imagining this being an infrequent, deliberate, action, not part of the continuous integration cycle.

For example when you submit your article for peer review, you use_zendo() to create a json metadata file, create a repo on https://www.zenodo.org/, and push the whole project to that repo. Then you get a snapshot of the project at that moment, with a DOI to put in the text of the paper.

Then, after peer review and your paper is accepted 🎉, you use_zenodo() again to update the repo with the final file set. I think zenodo has versioned repos, so you can have the same DOI for the repo, but different hashes for each version.

@MartinHinz
Copy link
Contributor

MartinHinz commented Jul 25, 2017

From what I read, Zenodo archiving something from Github is connected to making a release. And if the connection exists, Zenodo than makes a snapshot from every released version. So is this our way to go?
See eg ropensci/RNeXML#96

@benmarwick
Copy link
Owner Author

I think we can do it directly from our console to zenodo. But it looks like the zenodo package actually does not have any functions we can use yet (karthik/zenodo#14). So let's put this on hold until that pkg gets a bit more love.

@nevrome
Copy link
Collaborator

nevrome commented Jul 25, 2017

Just a minor comment independent of the actual implementation: In my opinion functions like use_zenodo() should have a certain threshold of control questions like devtools::release(). It's a pretty big thing to release a paper on zenodo. A set of questions can prevent accidental and immature releases.

@benmarwick
Copy link
Owner Author

Yes, that is an excellent suggestion, I agree. I guess that @karthik has something like that already in mind for zenodo::zen_file_publish

@MartinHinz
Copy link
Contributor

Wouldn't it be in our case the most convenient way a two step process, with

  • one function (eg. rrtools::use_zenodo()) to establish a connection between the github repo and zenodo,
  • and another function(eg. rrtools::make_github_release() to trigger a release on github (guess) and
    • [automatically a zenodo publish] if the connection is established,
    • [nothing] otherwise?

@benmarwick
Copy link
Owner Author

Yes, that could work. It seems more natural to me to connect from the local repo on my computer directly to zenodo, without counting on GitHub in the middle. That would be simpler and more flexible, to me at least. But let's see what direction they take with the zenodo pkg as it develops further.

@MartinHinz
Copy link
Contributor

I give you that. The whole thing is centered around Github, so it seemed to me natural to use the existing link Zenodo <-> Github to make it happened. But when thinking about it, at least you could use use_compendium, use_mit_license, use_readme_rmd and use_analysis, and also use_testthat without any Github integration.

So surely you are right and we should not make that dependent on Github repo being in existence.

@karthik
Copy link

karthik commented Jul 26, 2017

The whole thing is centered around Github

And the plan is to take advantage of all of that. So the most recent project I worked on with Kirill Muller is Travis and Tic (both of which are available as beta release on ropenscilabs). In short, you can set up a recipe for Zenodo (leveraging the Zenodo package) to create a release for software and data at whatever interval (with versioning support). Once you've set it up and authorized a token, it should just run for any project.

@benmarwick
Copy link
Owner Author

benmarwick commented Jul 27, 2017

Thanks Karthik! Are there any of these recipes around for us to take a look at?

We should also consider here:

@karthik
Copy link

karthik commented Jul 27, 2017

Hi @benmarwick!
There are several recipes to look at now, but none are related to data in particular. But it would be the same logic, and other than S3 class support for a common package (to support data deposition), all the functionality will come from individual packages.

Few recipes to consider:

A tic file for automatic packagedown docs: https://github.com/krlmlr/tic.package

Automatically deploying to drat: https://github.com/krlmlr/tic.drat

A rmarkdown site: https://github.com/krlmlr/tic.website

An automatic bookdown book:
https://github.com/krlmlr/tic.bookdown

Great idea to add figshare and dataverse. figshare might be a challenge because they have never prioritized their API and are solely focused on enterprise customers. But we can try.

@benmarwick benmarwick changed the title use_zenodo use_data_repository Jul 27, 2017
@benmarwick
Copy link
Owner Author

Thanks again, those are useful to see.

Currently I think we want to be pushing our compendium to a data repo independently of our actions on GitHub and Travis.

As @nevrome notes above, making a deposit to a data repo should be a deliberate, infrequent action in the life of a project, so we want it to be separate from the push-to-github-trigger-travis process. I'm imagining this just happens 1-3 times in the life of a project.

My guess is that we could have something like a use_data_repository(what = ".", repo = c("figshare", "dataverse", "osf", "dataverse", "zenodo")) function, which triggers some follow-through steps in the console for the user to confirm some details, and get information to create metadata, before pushing to the data repository. Depending on the repo chosen by the user, we then get the other pkgs to do the heavy lifting with the repo API.

What does everyone think?

@benmarwick
Copy link
Owner Author

I was just reminded by @steko about https://frictionlessdata.io/ and pkgs https://github.com/ropenscilabs/datapkg and https://github.com/christophergandrud/dpmr

These look neat, though I've not seen them is use in the wild anywhere, and their download stats are modest.

Has anyone else come across these in the research literature? Worth mentioning in the readme?

@nevrome
Copy link
Collaborator

nevrome commented Oct 11, 2017

Never seen this. But it seems to be a thing. So why not?

@benmarwick
Copy link
Owner Author

benmarwick commented Jan 3, 2018

Bookmarking this rOpenSci discussion that just appeared: ropensci-archive/doidata#1 Seems like they might be about to develop a pkg that will answer many of our needs here. Some discussion on twitter at https://twitter.com/noamross/status/948340525492555776

Hopefully that pkg will contain a function to deposit data and obtain a DOI (to a variety of repositories), although I guess that task might be much more complex than getting data using a DOI.

@noamross
Copy link

noamross commented Jan 3, 2018

@benmarwick Right now we're just thinking about downloading the data given a DOI

@januz
Copy link

januz commented Oct 16, 2018

Has there been progress on this front? What are your current recommendations for getting a DOI associated with the state of a compendium when the corresponding manuscript was submitted/revised/published? Thanks!

@benmarwick
Copy link
Owner Author

We haven't seen any recent developments that have made automating this step simpler or more obvious to implement, I mean there is so much variation in current practice it's hard to know what defaults make the most sense.

My current recommendations are to use a hook provided by the data repository service (e.g. Zenodo, OSF, Figshare have this) to connect to the GitHub repo with the compendium, and then make a snapshot a version of the GH repo on the data repo at key points (in OSF this is called 'registering' or freezing a version of the repo). I usually snapshot the repo at the point of submission to the journal, and get the DOI of the repo to include in the text of the manuscript. Then snapshot it again after peer review, and again after final acceptance. The DOI stays the same throughout this process, on OSF at least, and any user can see that the data repository has multiple versions and can browse them easily. The repo versions can be tagged with keywords to indicate what part of the process they relate to.

This all happens outside of R. And for me, at least, it's something I do infrequently, just a few times per year. So it's not urgent for me to automate or highly streamline these steps at the moment. But I'm keen to know more about how others might imagine how these steps could be incorporated into a function!

@januz
Copy link

januz commented Oct 16, 2018

@benmarwick Thank you so much for your detailed explanations. I agree that this process is probably something that can/should be done deliberately and "manually".

@benmarwick
Copy link
Owner Author

Yes, for now manual handling seems like the best option for this step, at least as far as I can see. I'm curious to see what might pop up in the future to change my mind!

@januz
Copy link

januz commented Dec 3, 2018

I usually snapshot the repo at the point of submission to the journal, and get the DOI of the repo to include in the text of the manuscript. Then snapshot it again after peer review, and again after final acceptance. The DOI stays the same throughout this process, on OSF at least, and any user can see that the data repository has multiple versions and can browse them easily

@benmarwick Sorry to follow up so late, but I just tested out registration/freezing of an OSF project with associated GitHub repository. From what I can see, the DOI of different registrations is not the same. Instead, the project has a fixed DOI and each registration have different ones.

I understood you in the way that you publish the DOI of the first registration that you create. Did you mean that you share the project's DOI (which stays the same) and people can then navigate to the "Registrations" tab and see the different registrations/snapshots that exist for the project?

Thanks!

@benmarwick
Copy link
Owner Author

Let's include the approaches discussed here in an informational final step in the readme to suggest how the user can archive their compendiumn on a data repo of their choosing, cf. #56

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants