Skip to content

Latest commit

 

History

History
39 lines (20 loc) · 2.78 KB

overview.md

File metadata and controls

39 lines (20 loc) · 2.78 KB

Python access patterns for NASA datasets behind Earth Data Login

The issue

xarray has become the de facto library to access multidimensional data in Python. Used along with the Pangeo stack, (Dask in particular) unlocks the potential to more efficient and scalable workflows for geospatial data analysis. xarray and Dask really shine when our data is gridded (processing level 3 and above), in a cloud optimized format and publicly available (not auth wall). Examples of this can be found in the Pangeo gallery where TB of data can be processed with simpler code using horizontally scalable infrastructure.

When earth scientists see the examples in the Pangeo gallery they often think about using them with NASA datasets. However, most of NASA's Earth datasets are not in a cloud optimized format and are behind NASA's EDL (Earth Data Login).

The main technical issue with EDL is that xarray and Dask the core components of the Pangeo stack use a package called fsspec to access files over HTTP and fsspec does not support OAuth2 out of the box. With local files this is not an issue but when we want to open files behind EDL over the network we run into problems, especially if we want to use a distributed Dask cluster.

Another aspect of data usability is that NASA supports different patterns for cloud-based datasets that introduce some complexity as users have to think about tokens, temporary keys and expiration times. This white paper by Patrick Quinn and others expands on some of these issues and potential solutions. Making cloud hosted data more accessible is very important as NASA is migrating their whole catalog to AWS and eventually will be the only supported distribution.

This diagram tries to put the different cloud hosted dataset access patterns into context:

Code examples:

import xarray as xr

file = "https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5"
# will this work?
ds = xr.open_dataset(file, engine="h5netcdf")
ds

Potential solutions

Pre-signed aiohttp sessions and leveraging the Thing Egress App (TEA)

TODO: expand on this and start developing a new back-end for fsspec