Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed Recipes for eVolv2k_v3 #164

Open
jordanplanders opened this issue Aug 2, 2022 · 7 comments
Open

Proposed Recipes for eVolv2k_v3 #164

jordanplanders opened this issue Aug 2, 2022 · 7 comments

Comments

@jordanplanders
Copy link
Contributor

Dataset Name

eVolv2k_v3

Dataset URL

https://www.wdc-climate.de/ui/entry?acronym=eVolv2k_v3_ds

Description

The eVolv2k database includes estimates of the magnitudes and approximate source latitudes of major volcanic stratospheric sulfur injection (VSSI) events from 500 BCE to 1900 CE.

License

https://www.wdc-climate.de/ui/info?site=termsofuse

Data Format

NetCDF

Data Format (other)

No response

Access protocol

Other

Source File Organization

There is only one file with data variables corresponding to year, yearCE, month, day, latitude, hemi, vssi, and vssi sigma. The file does not have any declared coordinates.

Example URLs

No response

Authorization

Username / Password

Transformation / Processing

No response

Target Format

Zarr

Comments

This dataset is available from WDC-Climate. Part of the website indicates data is only available with credentials via their JBLOB interface or the web UI (the source code is fairly dense javascript to my relatively untrained eye), but perhaps there is another access point via swift (swift.dkrz.de/) (https://docs.dkrz.de/doc/dataservices/finding_and_accessing_data/index.html#dkrz-data-pool). I made a brief (and unsuccessful) go at using fsspec_open_kwargs to pass credentials, though based on what I have seen, I'm not surprised that it didn't work.

Using GitHub as a temporary location for the data, I got this to work based on examples:

import xarray as xr
import cftime
import numpy as np
from pangeo_forge_recipes.patterns import ConcatDim, FilePattern
from pangeo_forge_recipes.recipes.xarray_zarr import XarrayZarrRecipe
from pangeo_forge_recipes.recipes import setup_logging

# location for data for testing purposes
stem = 'https://github.com/jordanplanders/public_facing_data/raw/main/pyleoclim_tutorials/eVolv2k_v3_ds_1.nc'
url = "{mock_concat}"+stem

time_concat_dim = ConcatDim("mock_concat", [""], nitems_per_file=1)

def make_full_path(mock_concat):
    return url.format(mock_concat=mock_concat)
filepattern = FilePattern(make_full_path, time_concat_dim)

# Create recipe object
recipe = XarrayZarrRecipe(filepattern, inputs_per_chunk=1, xarray_open_kwargs={ 'use_cftime':True, 'decode_times':True}, copy_input_to_local_file=True)

# Set up logging
setup_logging()

# Prune the recipe
recipe_pruned = recipe.copy_pruned()
print(recipe_pruned)

# Run the recipe
run_function = recipe.to_function()

run_function = recipe_pruned.to_function()
run_function()

# # Check the output
vol_zarr = xr.open_zarr(recipe.target_mapper, consolidated=False, use_cftime=True, decode_times=True)

# Add time as a coordinate
vol_zarr['time'] = np.array([cftime.DatetimeProlepticGregorian(vol_zarr['year'].values[ik],
                                                               vol_zarr['month'].values[ik], 
                                                               vol_zarr['day'].values[ik] , 
                                                               has_year_zero=True) for ik in range(len(vol_zarr['year']))] )
vol_zarr = vol_zarr.set_coords(['time'])
@jordanplanders
Copy link
Contributor Author

@cisaacstern (as I see it) URLs will inevitably change, and recipes will need to be updated accordingly, which begs a question: what is PF perspective on small datasets stored in public Github repos?

I exchanged emails with the author (Matt Toohey) of this dataset and he suggested making it available from his own GitHub repo to sidestep this particular authentication issue. Is that an accepted approach. (I can imagine this is not relevant in most cases because the size of most datasets is prohibitive, but this is probably not the only instance of its kind.)

@cisaacstern
Copy link
Member

cisaacstern commented Sep 15, 2022

@jordanplanders, if the data is small enough to host on GitHub, perhaps hosting on Zenodo is preferable?

Re: side-stepping auth this way, let's just make sure that the data provider's authentication requirement does not also mean that hosting a publicly accessible mirror does not run afoul of the license?

Generally, I'm thrilled to see any use of Pangeo Forge, though if the data is small enough to host on GitHub, perhaps it's worth asking what/any value add Pangeo Forge + Zarr is giving here?

@jordanplanders
Copy link
Contributor Author

jordanplanders commented Sep 15, 2022

@cisaacstern Yep! Very valid! This is one of those moments when I either need a law degree or practice deciphering licensing (I guess those are sort of the same), but by default it is CC4 and the FAQs suggest that the authentication is catch all because some hosted datasets require users be granted permission to download, so if the dataset isn't actually in that protected category, it seems like it wouldn't be problematic to mirror it elsewhere.

As far as the question about "why PF+Zarr for a small dataset?", my instinct was that having a "one stop shop" approach might be most effective way to keep the friction involved in moving to a python-based, multiple and varied dataset, transparent analysis culture below the surrender threshold. For folks who aren't natively data science-style data wranglers (probably particularly true among those who work with small datasets), my hunch is that navigating multiple sources and protocols might result in some attrition. Does any of that ring relevant?

I'm not sure about this, but this seems like the slickest way to access and work with data in a cloud hub working environment, which is becoming more common, I think.

@cisaacstern
Copy link
Member

Yep, that makes sense to me. Thanks for thinking through it out loud. Among other things, hopefully these conversations may be useful to others contemplating similar things down the line.

In terms of where to host some mirror of the data outside the auth wall (as a stopgap until Pangeo Forge supports user-supplied credentials), I think Zenodo may be the more appropriate choice, but if GitHub is easier and you want to experiment with that, I don't see any reason not to.

@jordanplanders
Copy link
Contributor Author

@cisaacstern Great! I'll talk to Matt about Zenodo and revisit the recipe with an eye toward the various things I've learned recently.

@cisaacstern
Copy link
Member

Looking forward to the PR! Please let me know if/how I can help.

@jordanplanders
Copy link
Contributor Author

@cisaacstern I'm still waiting to hear from Matt about whether he wants to use Zenodo, but In case others want to point to files stored in GitHub in future, it's worth knowing that urls that point to blob type will throw the following error:

File ~/opt/miniconda3/envs/pyleo_tutorials/lib/python3.8/site-packages/xarray/backends/h5netcdf_.py:150, in H5NetCDFStore.open(cls, filename, mode, format, group, lock, autoclose, invalid_netcdf, phony_dims, decode_vlen_strings)
    148     magic_number = read_magic_number_from_file(filename)
    149     if not magic_number.startswith(b"\211HDF\r\n\032\n"):
--> 150         raise ValueError(
    151             f"{magic_number} is not the signature of a valid netCDF4 file"
    152         )
    154 if format not in [None, "NETCDF4"]:
    155     raise ValueError("invalid format for h5netcdf backend")

ValueError: b'\n\n\n\n\n\n\n<' is not the signature of a valid netCDF4 file

I chased it around with file_type briefly before realizing I had seen this before.

This file will work:

but neither of these will:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants