SCIWS12: Pangeo: Hands on with JupyterHub and Open-source Python Tools for Scalable Analysis of Big Data in the Geosciences
Date and Time: Sunday December 08, 2019 1:40 PM - 6:00 PM
Location: Grand Hyatt Union Square, San Francisco, CA. Conference Theatre (Theatre Level)
Instructors: Scott Henderson (@scottyhq), Joe Hamman (@jhamman), Amanda Tan Lehr (@amanda-tan), and Jessica Scheick (@JessicaS11)
Abstract Bring your laptop to this hands-on workshop sponsored by the Pangeo project! Pangeo is first and foremost a community promoting open, reproducible, and scalable science. Participants will learn about Python and several open-source software packages for analytic workflows with big data in Earth Science. Particular emphasis will be given to Jupyter, Xarray, and Dask software libraries for analysis of multidimensional modeling and remote sensing data sets. In this half-day workshop, participants will familiarize themselves with writing code in Jupyter Notebooks that can be run on scalable computing clusters running on the Cloud, bypassing a common bottleneck of downloading ever-increasing volumes of remote sensing or modeling data. We will have both introductory tutorials and advanced examples used in peer-reviewed research in the fields of oceanography, hydrology, and solid earth geophysics. For those interested in building JupyterHub infrastructure for research groups we will also spend time explaining costs and how these computational resources can be deployed on local servers, HPC systems, or commercial Clouds.
Workshop Agenda
Time | Topic |
---|---|
1:40 - 2:10 | Introductions, Pangeo and Jupyter communities and software ecosystems |
2:10 - 2:30 | Geopandas tutorial for geospatial vector data |
2:30 - 2:50 | 20 minute Coffee Break |
2:50 - 3:10 | Xarray tutorial for geospatial raster data |
3:10 - 3:30 | Dask for distributed computing |
3:30 - 3:50 | Intake for data management |
3:50 - 4:20 | 30 minute break / work on exercises |
4:20 - 5:00 | Scaling up with AWS Landsat8 Public Data |
5:00 - 5:30 | Scaling up with CESM-Lens climate models |
5:30 - 6:00 | Survey + Feedback, Q&A + next steps |
Learning Objectives:
- Recognize the software packages that comprise the Pangeo platform and explain how they work together
- Build familiarity with key geospatial Python software - geopandas for vector data, xarray for raster data
- Learn how to work with larger-than-memory datasets using Dask
- Understand how to efficiently work with data on Cloud infrastructure