Proposed Recipes for eVolv2k_v3 #164

jordanplanders · 2022-08-02T18:38:05Z

Dataset Name

eVolv2k_v3

Dataset URL

https://www.wdc-climate.de/ui/entry?acronym=eVolv2k_v3_ds

Description

The eVolv2k database includes estimates of the magnitudes and approximate source latitudes of major volcanic stratospheric sulfur injection (VSSI) events from 500 BCE to 1900 CE.

License

https://www.wdc-climate.de/ui/info?site=termsofuse

Data Format

NetCDF

Data Format (other)

No response

Access protocol

Other

Source File Organization

There is only one file with data variables corresponding to year, yearCE, month, day, latitude, hemi, vssi, and vssi sigma. The file does not have any declared coordinates.

Example URLs

No response

Authorization

Username / Password

Transformation / Processing

No response

Target Format

Zarr

Comments

This dataset is available from WDC-Climate. Part of the website indicates data is only available with credentials via their JBLOB interface or the web UI (the source code is fairly dense javascript to my relatively untrained eye), but perhaps there is another access point via swift (swift.dkrz.de/) (https://docs.dkrz.de/doc/dataservices/finding_and_accessing_data/index.html#dkrz-data-pool). I made a brief (and unsuccessful) go at using fsspec_open_kwargs to pass credentials, though based on what I have seen, I'm not surprised that it didn't work.

Using GitHub as a temporary location for the data, I got this to work based on examples:

import xarray as xr
import cftime
import numpy as np
from pangeo_forge_recipes.patterns import ConcatDim, FilePattern
from pangeo_forge_recipes.recipes.xarray_zarr import XarrayZarrRecipe
from pangeo_forge_recipes.recipes import setup_logging

# location for data for testing purposes
stem = 'https://github.com/jordanplanders/public_facing_data/raw/main/pyleoclim_tutorials/eVolv2k_v3_ds_1.nc'
url = "{mock_concat}"+stem

time_concat_dim = ConcatDim("mock_concat", [""], nitems_per_file=1)

def make_full_path(mock_concat):
    return url.format(mock_concat=mock_concat)
filepattern = FilePattern(make_full_path, time_concat_dim)

# Create recipe object
recipe = XarrayZarrRecipe(filepattern, inputs_per_chunk=1, xarray_open_kwargs={ 'use_cftime':True, 'decode_times':True}, copy_input_to_local_file=True)

# Set up logging
setup_logging()

# Prune the recipe
recipe_pruned = recipe.copy_pruned()
print(recipe_pruned)

# Run the recipe
run_function = recipe.to_function()

run_function = recipe_pruned.to_function()
run_function()

# # Check the output
vol_zarr = xr.open_zarr(recipe.target_mapper, consolidated=False, use_cftime=True, decode_times=True)

# Add time as a coordinate
vol_zarr['time'] = np.array([cftime.DatetimeProlepticGregorian(vol_zarr['year'].values[ik],
                                                               vol_zarr['month'].values[ik], 
                                                               vol_zarr['day'].values[ik] , 
                                                               has_year_zero=True) for ik in range(len(vol_zarr['year']))] )
vol_zarr = vol_zarr.set_coords(['time'])

The text was updated successfully, but these errors were encountered:

jordanplanders · 2022-09-15T18:58:16Z

@cisaacstern (as I see it) URLs will inevitably change, and recipes will need to be updated accordingly, which begs a question: what is PF perspective on small datasets stored in public Github repos?

I exchanged emails with the author (Matt Toohey) of this dataset and he suggested making it available from his own GitHub repo to sidestep this particular authentication issue. Is that an accepted approach. (I can imagine this is not relevant in most cases because the size of most datasets is prohibitive, but this is probably not the only instance of its kind.)

cisaacstern · 2022-09-15T19:38:32Z

@jordanplanders, if the data is small enough to host on GitHub, perhaps hosting on Zenodo is preferable?

Re: side-stepping auth this way, let's just make sure that the data provider's authentication requirement does not also mean that hosting a publicly accessible mirror does not run afoul of the license?

Generally, I'm thrilled to see any use of Pangeo Forge, though if the data is small enough to host on GitHub, perhaps it's worth asking what/any value add Pangeo Forge + Zarr is giving here?

jordanplanders · 2022-09-15T21:10:17Z

@cisaacstern Yep! Very valid! This is one of those moments when I either need a law degree or practice deciphering licensing (I guess those are sort of the same), but by default it is CC4 and the FAQs suggest that the authentication is catch all because some hosted datasets require users be granted permission to download, so if the dataset isn't actually in that protected category, it seems like it wouldn't be problematic to mirror it elsewhere.

As far as the question about "why PF+Zarr for a small dataset?", my instinct was that having a "one stop shop" approach might be most effective way to keep the friction involved in moving to a python-based, multiple and varied dataset, transparent analysis culture below the surrender threshold. For folks who aren't natively data science-style data wranglers (probably particularly true among those who work with small datasets), my hunch is that navigating multiple sources and protocols might result in some attrition. Does any of that ring relevant?

I'm not sure about this, but this seems like the slickest way to access and work with data in a cloud hub working environment, which is becoming more common, I think.

cisaacstern · 2022-09-16T02:45:47Z

Yep, that makes sense to me. Thanks for thinking through it out loud. Among other things, hopefully these conversations may be useful to others contemplating similar things down the line.

In terms of where to host some mirror of the data outside the auth wall (as a stopgap until Pangeo Forge supports user-supplied credentials), I think Zenodo may be the more appropriate choice, but if GitHub is easier and you want to experiment with that, I don't see any reason not to.

jordanplanders · 2022-09-16T19:45:21Z

@cisaacstern Great! I'll talk to Matt about Zenodo and revisit the recipe with an eye toward the various things I've learned recently.

cisaacstern · 2022-09-16T20:06:25Z

Looking forward to the PR! Please let me know if/how I can help.

jordanplanders · 2022-09-20T03:03:39Z

@cisaacstern I'm still waiting to hear from Matt about whether he wants to use Zenodo, but In case others want to point to files stored in GitHub in future, it's worth knowing that urls that point to blob type will throw the following error:

File ~/opt/miniconda3/envs/pyleo_tutorials/lib/python3.8/site-packages/xarray/backends/h5netcdf_.py:150, in H5NetCDFStore.open(cls, filename, mode, format, group, lock, autoclose, invalid_netcdf, phony_dims, decode_vlen_strings)
    148     magic_number = read_magic_number_from_file(filename)
    149     if not magic_number.startswith(b"\211HDF\r\n\032\n"):
--> 150         raise ValueError(
    151             f"{magic_number} is not the signature of a valid netCDF4 file"
    152         )
    154 if format not in [None, "NETCDF4"]:
    155     raise ValueError("invalid format for h5netcdf backend")

ValueError: b'\n\n\n\n\n\n\n<' is not the signature of a valid netCDF4 file

I chased it around with file_type briefly before realizing I had seen this before.

This file will work:

(link for Download button) https://github.com/matthew2e/easy-volcanic-aerosol/raw/master/eVolv2k_v2.1_ds_1.nc

but neither of these will:

(link for file from parent level) https://github.com/matthew2e/easy-volcanic-aerosol/blob/master/eVolv2k_v2.1_ds_1.nc
(permalink) https://github.com/matthew2e/easy-volcanic-aerosol/blob/e04ef2b78a7c69334f839a870363124f3a1d796b/eVolv2k_v2.1_ds_1.nc

jordanplanders added the proposed recipe label Aug 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed Recipes for eVolv2k_v3 #164

Proposed Recipes for eVolv2k_v3 #164

jordanplanders commented Aug 2, 2022

jordanplanders commented Sep 15, 2022

cisaacstern commented Sep 15, 2022 •

edited

Loading

jordanplanders commented Sep 15, 2022 •

edited

Loading

cisaacstern commented Sep 16, 2022

jordanplanders commented Sep 16, 2022

cisaacstern commented Sep 16, 2022

jordanplanders commented Sep 20, 2022

Proposed Recipes for eVolv2k_v3 #164

Proposed Recipes for eVolv2k_v3 #164

Comments

jordanplanders commented Aug 2, 2022

Dataset Name

Dataset URL

Description

License

Data Format

Data Format (other)

Access protocol

Source File Organization

Example URLs

Authorization

Transformation / Processing

Target Format

Comments

jordanplanders commented Sep 15, 2022

cisaacstern commented Sep 15, 2022 • edited Loading

jordanplanders commented Sep 15, 2022 • edited Loading

cisaacstern commented Sep 16, 2022

jordanplanders commented Sep 16, 2022

cisaacstern commented Sep 16, 2022

jordanplanders commented Sep 20, 2022

cisaacstern commented Sep 15, 2022 •

edited

Loading

jordanplanders commented Sep 15, 2022 •

edited

Loading