Passing multiple kerchunk sideload files to `open_mfdataset`, not possible with intake #135

okz · 2023-09-11T20:11:30Z

Standard intake plugins seem to support glob, * or list urlpath's, to consume multiple files with open_mfdataset. This aproach isn't suitable for the intake_xarray.xzarr.ZarrSource plugin since it expects the (urlpath: "reference://"), and uses storage_options::fo to load the sideload file:

    driver: intake_xarray.xzarr.ZarrSource
    args:
      urlpath: "reference://"
      storage_options:
        fo: "sideload.json"

Ideally catalog fo, should be able to accept glob paths ?

More details:

Having many netcdf files with variable dimensions, we hit the "irregular chunk size between files issue" trying to use kerchunk.
So instead of combining netcdf files, to a single sideload json file, we created a sideload .json for each netcdf file, and let xarray take care of the merge. For our datasets this was good enough, and made working with several months of remote data, possible.

Using xarray open_mfdataset directly, it was possible to use multiple jsons. e.g:

m_list = []
for js in urls:
    with fs.open(js) as f:
        m_list.append(fsspec.get_mapper("reference://", 
                      fo=ujson.load(f), remote_protocol="file",
                      remote_options=so))

ds = xr.open_mfdataset(m_list, engine='zarr', 
                        combine="nested", 
                        backend_kwargs={
                            "consolidated": False, 
                        },
                        concat_dim="time")

It would have been nice to get rid of this code, and use an intake catalog.

The text was updated successfully, but these errors were encountered:

martindurant · 2023-09-15T01:18:50Z

I wonder, does it work to phrase the URL as:

[f"reference://::{u}", for u in urls]

?

By the way, xarray typically still does have to do a certain amount of work in such a case, so you might want to use kerchunk.combine.MultiZarrToZarr to create a single reference set across all the inputs, so that you don't need open_mfdataset at all.

okz · 2023-09-15T10:19:43Z

[f"reference://::{u}", for u in urls]

even providing reference and fo hardcoded as a list, the zarr intake plugin fails, I don't think the xzarr.ZarrSource makes an attempt to accept multiple jsons in the 'fo' which has a usecase.
AttributeError: 'list' object has no attribute 'get'.

kerchunk.combine.MultiZarrToZarr

That was the initial goal, but right now MultiZarrToZarr, only supports regular chunking between files. The data has many dimensions and most are not chunked regularly. There isn't a way around it as far as I know?

PS: Almost gave up on using kerchunk, and the open_mfdataset approach, although not the best, was a lifesaver. Maybe it should be documented somewhere? It's still several factors faster than opening direct netcdf files.

martindurant · 2023-09-15T15:52:13Z

It ought to not be too complex to fold this into intake-xarray. We do try to stay close to what xarray itself offers, so one could argue that if open_mfdataset accepts a lis of URLs or paths, it should allow for a list of storage_options-per-path too, and then everyone gets this kind of workflow, not just intake users.

The data has many dimensions and most are not chunked regularly. There isn't a way around it as far as I know?

We require ZEP003 in zarr. Please ping the discussion and this draft implementation: zarr-developers/zarr-python#1483

observingClouds · 2024-05-02T22:49:42Z

I just came across this issue as I was searching for an option to merge two datasets originating from two kerchunk reference datasets with different chunk sizes.

I tested the workflow with xr.open_mfdataset and can confirm that the url chaining with several fo works!

import xarray as xr
xr.open_mfdataset(['reference://::ref1.json', 'reference://::ref2.json'], engine='zarr', storage_options={'remote_protocol':'s3', 'remote_options':{'anon':'true'}})

and it also works with intake:

sources:
  some_dataset:
    driver: zarr
    args:
      urlpath:
        - reference://::ref1.json
        - reference://::ref2.json
      storage_options:
        remote_protocol: s3
        remote_options:
          anon: true

observingClouds · 2024-05-02T23:26:46Z

Here is a working example:

import intake
cat = intake.open_catalog("https://github.com/ISSI-CONSTRAIN/isccp/raw/main/catalog.yaml")
cat['ISCCP_BASIC_HGH'].to_dask()

observingClouds mentioned this issue May 2, 2024

HGH chunk sizes changes from time=5 to time=1 in 2016 and 2017 ISSI-CONSTRAIN/isccp#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Passing multiple kerchunk sideload files to `open_mfdataset`, not possible with intake #135

Passing multiple kerchunk sideload files to `open_mfdataset`, not possible with intake #135

okz commented Sep 11, 2023

martindurant commented Sep 15, 2023

okz commented Sep 15, 2023 •

edited

Loading

martindurant commented Sep 15, 2023

observingClouds commented May 2, 2024

observingClouds commented May 2, 2024

Passing multiple kerchunk sideload files to open_mfdataset, not possible with intake #135

Passing multiple kerchunk sideload files to open_mfdataset, not possible with intake #135

Comments

okz commented Sep 11, 2023

martindurant commented Sep 15, 2023

okz commented Sep 15, 2023 • edited Loading

martindurant commented Sep 15, 2023

observingClouds commented May 2, 2024

observingClouds commented May 2, 2024

Passing multiple kerchunk sideload files to `open_mfdataset`, not possible with intake #135

Passing multiple kerchunk sideload files to `open_mfdataset`, not possible with intake #135

okz commented Sep 15, 2023 •

edited

Loading