Skip to content

Dorothy - Making Scientific Data Transparent, Accessible, and Reproducible

License

Notifications You must be signed in to change notification settings

39alpha/dorothy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dorothy - Making Scientific Data Transparent, Accessible, and Reproducible

Introduction

Dorothy is a unified solution for data management, versioning, hosting, and distribution, and aims to be accessible to researchers in any field, working from anywhere, managing any kind of data, from initial data curation through to publication and long-term archiving.

While Dorothy is still a work in progress, we have four ambitious objectives:

  1. Make data more transparent. Researchers will be able to easily track versions of their data over time, linking specific versions to particular analyses.

  2. Increase data accessibility. Anyone with an internet connection will be able to quickly download and contribute data products, either via centralized repositories or from the peer-to-peer network, opening collaboration and access possibilities otherwise impossible.

  3. Improve reproducibility. Datasets are referenceable by their content, not their names. Subsequent efforts built on such data can be certain that the data assets are identical to those used previously, improving reproducibility.

  4. Further inclusive practices. Dorothy will provide both the tools and venue for diverse and inclusive communities of researchers around the world, analogous to GitHub for software developers. Dorothy will also provide data storage and dissemination resources to those without the means to run their own Dorothy node.

Ideally, updating a dataset should be as simple as:

# Clone an existing dataset to your machine
$ dorothy clone https://dorothy.39alpharesearch.org/team/dataset
$ cd dataset

# View the history
$ dorothy log

# Checkout a version
$ dorothy checkout Qm123 data

# Edit the data

# Commit a new version
$ dorothy commit data

# Push the changes back to the remote host
$ dorothy push

Dorothy comes with a "dataforge" analgous to Gitlab/Github, but specifically for managing datasets.

$ dorothy serve

Anyone can host a Dorothy dataforge if they choose, or use a

Getting Started

Installation from Source

$ git clone https://github.com/39alpha/dorothy
$ cd dorothy
$ make
$ make install
$ sudo mv dorothy /usr/bin/dorothy # not ideal, but it's what we've got ATM
Build Dependencies

Go >= 1.22, nodejs

Binary Releases

At the moment, we don’t have binary releases setup.

Intellectual Relatives

Foundations and Inspiration
  • git - Dorothy’s interface is designed to mirror git

  • darcs - The way Dorothy manages history mirrors darcs in many ways

  • IPFS - Dorothy uses IPFS for content-based hashing, deduplication and peer-to-peer networking.

Alternatives
  • Qri - An abandoned attempt a data-management via IPFS

  • Dolt - "Git for Data" based on a database

  • Quilt - "A data mesh for connecting people with actionable data"

  • DVC - "ML Experiments and Data Management with Git"

The Dorothy Community

Public Dataforges

No public dataforges exist quite yet.

Copyright © 2023-2024 39 Alpha Research. Free use of this software is granted under the terms of the MIT License.

Support

This project was supported by the National Aeronautics and Space Administration (NASA) under Grant Number 22-HPOSS22-0021, through Research Opportunities in Space and Earth Science (ROSES-2022), Program Element F.15 High Priority Open-Source Science.

If you wish to further support this project, or 39 Alpha Research in general, please visit https://39alpharesearch.org/donate.

About

Dorothy - Making Scientific Data Transparent, Accessible, and Reproducible

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published