fsspec source #467

juntyr · 2024-09-20T07:08:57Z

Is your feature request related to a problem? Please describe.

I haven't yet found a good way to open large (exceeds RAM) remote (not on my local file system) GRIB files in xarray.

Describe the solution you'd like

A new source would be added, e.g.

earthkit.data.from_source(
    "fsspec", uri_or_file, fs=None, **storage_args,
)

that would be similar to the "file" source in making use of random access but use Python's file-like interface (so perhaps "file-like" would be another name) and thus add support for fsspec's numerous backends to earthkit for free.

This new source should also support loading large GRIB datasets without reading the entire file. Ideally, loading the GRIB file into xarray would only read as little data as possible and defer any data reads until the user specifically asks for the data (similar to how NetCDF and Zarr support lazy-loading).

Describe alternatives you've considered

ds = earthkit.data.from_source(
    "stream", fsspec.open("<uri>").open(), batch_size=0,
).to_xarray()

(inspired by ecmwf/cfgrib#326 (comment)) provides the closest current solution but treats the file pessimistically as only a stream and not as a random-access file, which results in excessive reads.

Additional context

I am working in an extremely memory-constrained environment and would like to support opening remote GRIB files (in addition to NetCDF and Zarr datasets which already work).

Organisation

University of Helsinki, ESiWACE3 project

The text was updated successfully, but these errors were encountered:

sandorkertesz · 2024-09-20T09:06:01Z

Thank you for the suggestion.

Just a remark. If you want to convert GRIB to xarray first you need to scan the whole file/files (all the messages) for metadata. So this is very much different to use case of NetCDF and zarr where this information is available "instantly".

juntyr · 2024-09-20T09:22:16Z

Thank you for the suggestion.

Just a remark. If you want to convert GRIB to xarray first you need to scan the whole file/files (all the messages) for metadata. So this is very much different to use case of NetCDF and zarr where this information is available "instantly".

I didn’t know that, is it related to GRIB’s format? In that case, would the index files help? If so, would it be possible to check if a pre-generated index file is available as well (e.g. a file-like object passed alongside or a relative fsspec uri) and to use that to skip the initial full-file scan?

In any case, it would be important that once the metadata has been extracted, the actual data is not kept in memory until requested by the user (so slices would still only be lazily loaded in).

juntyr added the enhancement New feature or request label Sep 20, 2024

juntyr mentioned this issue Sep 20, 2024

Add a notebook on opening a remote zarr dataset climet-eu/compression-lab-notebooks#13

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fsspec source #467

fsspec source #467

juntyr commented Sep 20, 2024

sandorkertesz commented Sep 20, 2024

juntyr commented Sep 20, 2024

fsspec source #467

fsspec source #467

Comments

juntyr commented Sep 20, 2024

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Organisation

sandorkertesz commented Sep 20, 2024

juntyr commented Sep 20, 2024