Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passing multiple kerchunk sideload files to open_mfdataset, not possible with intake #135

Open
okz opened this issue Sep 11, 2023 · 5 comments

Comments

@okz
Copy link

okz commented Sep 11, 2023

Standard intake plugins seem to support glob, * or list urlpath's, to consume multiple files with open_mfdataset. This aproach isn't suitable for the intake_xarray.xzarr.ZarrSource plugin since it expects the (urlpath: "reference://"), and uses storage_options::fo to load the sideload file:

    driver: intake_xarray.xzarr.ZarrSource
    args:
      urlpath: "reference://"
      storage_options:
        fo: "sideload.json"

Ideally catalog fo, should be able to accept glob paths ?

More details:

Having many netcdf files with variable dimensions, we hit the "irregular chunk size between files issue" trying to use kerchunk.
So instead of combining netcdf files, to a single sideload json file, we created a sideload .json for each netcdf file, and let xarray take care of the merge. For our datasets this was good enough, and made working with several months of remote data, possible.

Using xarray open_mfdataset directly, it was possible to use multiple jsons. e.g:

m_list = []
for js in urls:
    with fs.open(js) as f:
        m_list.append(fsspec.get_mapper("reference://", 
                      fo=ujson.load(f), remote_protocol="file",
                      remote_options=so))

ds = xr.open_mfdataset(m_list, engine='zarr', 
                        combine="nested", 
                        backend_kwargs={
                            "consolidated": False, 
                        },
                        concat_dim="time")

It would have been nice to get rid of this code, and use an intake catalog.

@martindurant
Copy link
Member

I wonder, does it work to phrase the URL as:

[f"reference://::{u}", for u in urls]

?

By the way, xarray typically still does have to do a certain amount of work in such a case, so you might want to use kerchunk.combine.MultiZarrToZarr to create a single reference set across all the inputs, so that you don't need open_mfdataset at all.

@okz
Copy link
Author

okz commented Sep 15, 2023

[f"reference://::{u}", for u in urls]

even providing reference and fo hardcoded as a list, the zarr intake plugin fails, I don't think the xzarr.ZarrSource makes an attempt to accept multiple jsons in the 'fo' which has a usecase.
AttributeError: 'list' object has no attribute 'get'.

kerchunk.combine.MultiZarrToZarr

That was the initial goal, but right now MultiZarrToZarr, only supports regular chunking between files. The data has many dimensions and most are not chunked regularly. There isn't a way around it as far as I know?

PS: Almost gave up on using kerchunk, and the open_mfdataset approach, although not the best, was a lifesaver. Maybe it should be documented somewhere? It's still several factors faster than opening direct netcdf files.

@martindurant
Copy link
Member

It ought to not be too complex to fold this into intake-xarray. We do try to stay close to what xarray itself offers, so one could argue that if open_mfdataset accepts a lis of URLs or paths, it should allow for a list of storage_options-per-path too, and then everyone gets this kind of workflow, not just intake users.

The data has many dimensions and most are not chunked regularly. There isn't a way around it as far as I know?

We require ZEP003 in zarr. Please ping the discussion and this draft implementation: zarr-developers/zarr-python#1483

@observingClouds
Copy link

I just came across this issue as I was searching for an option to merge two datasets originating from two kerchunk reference datasets with different chunk sizes.

I tested the workflow with xr.open_mfdataset and can confirm that the url chaining with several fo works!

import xarray as xr
xr.open_mfdataset(['reference://::ref1.json', 'reference://::ref2.json'], engine='zarr', storage_options={'remote_protocol':'s3', 'remote_options':{'anon':'true'}})

and it also works with intake:

sources:
  some_dataset:
    driver: zarr
    args:
      urlpath:
        - reference://::ref1.json
        - reference://::ref2.json
      storage_options:
        remote_protocol: s3
        remote_options:
          anon: true

@observingClouds
Copy link

Here is a working example:

import intake
cat = intake.open_catalog("https://github.com/ISSI-CONSTRAIN/isccp/raw/main/catalog.yaml")
cat['ISCCP_BASIC_HGH'].to_dask()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants