Materials for the Reproducible Data Science in Python tutorial at SciPy 2019.
Presenters: Chandrasekhar Ramakrishnan (Swiss Data Science Center) and Xu Fei (Code Ocean)
Date | Change |
---|---|
2019-06-18 | Initial version |
2019-06-25 | Added instructions for windows |
2019-07-01 | Updated environment.yml to work on conda 4.7 and 4.6 |
2019-07-02 | Updated environment.yml to work be cross-platform |
2019-07-04 | Added dot as a dependency |
The expectation of reproducibility in scientific work has been long established, and, increasingly, communities and funding sources are actually demanding it. Within the Python ecosystem, there are a variety of tools available to support reproducible data science, but choosing and using one is not always straightforward. In this tutorial, we will take a closer look at the concept of reproducibility, and, we will examine the technologies that provide building blocks and survey the landscape of tools. We spend the majority of the time looking at two solutions in particular, Renku and Code Ocean, and work through end-to-end scenarios in both.
To avoid conflicts in dependencies, we recommend creating a dedicated environment for this tutorial. You can do this using any tool you like, for example pipenv or conda.
We provide instructions based on conda below. If you use docker, we also provide a Dockerfile with instructions for set up and use. If you prefer to use something else, you will need to ensure that git
, git-lfs
, curl
, and node
are installed in your environment, but you should be able to pip install the requirements.txt file for the rest.
And, if you do not wish to set up an environment on your computer, you can follow these instructions to use Renkulab; or you can run the tutorial on Code Ocean or MyBinder.
Create environment using conda
- If you do not yet have conda, you should first install miniconda for your platform
- Download the conda environment
- In the directory where
environment.yml
is located, executeconda env create
Verifying the setup
- Activate the environment with
conda activate r10eds
- Run
git --version
-- the result should be "git version 2.21.0" (or newer) - Run
git lfs --version
-- the result should be "git-lfs/2.7.1" (or newer) - Run
renku --version
-- the result should be "0.5.0" (or newer)
Additional setup on Windows On Windows, an additional step is necessary. Renku creates symbolic links, and on Windows it is necessary to have privileges in order to do that. Follow these instructions from from StackExchange/Super User to give your user these privileges.
- Activate the environment with
conda activate r10eds
- Clone the repository
git clone https://github.com/SwissDataScienceCenter/r10e-ds-py.git
Once you have the environment set up and repository cloned, you can use them.
- cd into the tutorial repository
cd r10e-ds-py
- Activate the environment with
conda activate r10eds
- Start jupyter lab
jupyter lab
(you can also use plan jupyter)
If you wish, you can install Docker Desktop. It is not a requirement, but it will make it possible to dig deeper into certain areas in the tutorial.
Introduction (1h) | ||
---|---|---|
15 min | Background & Theory | Terminology, history, and philosophy of reproducibility |
30 min | Building Blocks | Building blocks for achieving reproducibility |
15 min | Tools | Survey of the current tool landscape |
Break (10 min) | ||
Hands-on with Renku (1h 30m) | ||
30 min | Starting | Starting a project, importing data, building a workflow |
30 min | Iterating | Updating code and data to improve analysis |
30 min | Details and Reflection | What is the benefit? How much effort was it? How do we view, share, and reuse artifacts? How do things work under the covers? |
Break (20 min) | ||
Hands-on with Code Ocean (1h) | ||
10 min | Demo of a Compute Capsule | Intro to Code Ocean and its design philosophy |
30 min | Creating a Compute Capsule | Create a reproducible compute capsule using code and data from the existing Renku project. We will explore options to publish, collaborate, import from Github, export to local server, etc. |
15 min | Q&A, Wrap up | Any questions that you want to ask |
Many thanks to Erica Moreira, Laura Levin-Gleba, and Maja Garbulinksa from the Harvard School of Public Health for their helpful comments and suggestions!
The icons used are from Icons8.