Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a notebook on opening a remote zarr dataset #13

Merged
merged 4 commits into from
Sep 23, 2024
Merged

Conversation

juntyr
Copy link
Contributor

@juntyr juntyr commented Sep 19, 2024

After many weeks of experiments in the background and a flurry of ideas and patches (climet-eu/lab@7846b57...086f285), aiohttp and fsspec now work sufficiently well in the lab to support opening remote zarr datasets :D

This PR adds a short notebook showcasing this functionality with a ~32TB example dataset

@juntyr
Copy link
Contributor Author

juntyr commented Sep 19, 2024

@SF-N Opening a remote Zarr datasets (here hosted on S3 but we're accessing it via HTTP) now works!

Do you have any suggestions as to what else should go into this notebook, aside form the standard titles and short explanations?

@SF-N
Copy link
Contributor

SF-N commented Sep 19, 2024

Thanks a lot, this is great.
For now, I don't think there is a lot to add to the notebook.
Depending on the data, it might be necessary to use earthkit.data.from_source to load the data instead of xarray.open_dataset (see the hplp_experiment branch for an example).
But I would postpone this to when we are working on the target datasets from IFS, EC-earth and ICON.

@juntyr
Copy link
Contributor Author

juntyr commented Sep 19, 2024

So far I’m a bit sceptical of earthkit’s from source, as it seems to mostly rely on downloading the datasets. Obviously we cannot do this in the online lab (we have a few MB of cache there, so any GB or TB dataset must be accessed remotely). There are good existing solutions, some format-specific (e.g. for NetCDF), some more general (using fsspec with zarr or h5netcdf). I think we should support those general options first and perhaps you can suggest to the earthkit team to add fsspec as another source (which then brings additional sources like remote http or s3 in for free) / to add an already-opened xarray as a source.

@juntyr
Copy link
Contributor Author

juntyr commented Sep 19, 2024

I've also made progress on loading the dataset with s3fs (see https://gist.github.com/juntyr/8b73265b2eeb766eba7075295d3cafbf), which I'll add to this notebook once it fully works (though it is much slower than going through HTTP directly right now)

@juntyr
Copy link
Contributor Author

juntyr commented Sep 20, 2024

@SF-N I've also experimented with supporting GRIB at https://gist.github.com/juntyr/14b3f80c58a39624641f9021450e5f28, but it seem like earthkit.data only supports streams (and not random-access file-like objects) for now.

I've opened ecmwf/earthkit-data#467 for this.

@juntyr
Copy link
Contributor Author

juntyr commented Sep 23, 2024

I've given up a bit on native support for NetCDF through netCDF4 (trying to make the libcurl dependency work has eaten far too much of my brain) or h5netcdf (which seems to not support large datasets).

However, I've managed to get everything to work using kerchunk! We can now load a ~19GB NetCDF file in the online lab:

https://gist.github.com/juntyr/23c2df3b3e20ac351591b99d70e19ca8

@juntyr
Copy link
Contributor Author

juntyr commented Sep 23, 2024

I've also tried to get it to work with GRIB files, but unfortunately merging different messages doesn't fully work yet in kerchunk (tracked in fsspec/kerchunk#358 and fsspec/kerchunk#150).

@SF-N perhaps you could reach out to the cfgrib or eccodes-python teams after your holiday to look at these kerchunk issues and see if they could help with a fix - being able to load larger remote GRIB files would be very valuable for the community.

My not-fully working example (the hybrid dimension is 1 instead of 10) is here:
https://gist.github.com/juntyr/9ebda19683691956700adacd12c4f806

@juntyr
Copy link
Contributor Author

juntyr commented Sep 23, 2024

The combined and documented notebook is now at https://gist.github.com/juntyr/a89175eb60a80150dc17bf553cd2e2d7.

I'll wait for sympy to be available in the lab before updating and merging this PR.

@juntyr juntyr marked this pull request as ready for review September 23, 2024 11:01
@juntyr juntyr merged commit b7bb484 into main Sep 23, 2024
@juntyr juntyr deleted the zarr-remote-demo branch September 23, 2024 11:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants