Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError for .dods for certain datasets #59

Open
sjordan29 opened this issue May 17, 2024 · 10 comments
Open

IndexError for .dods for certain datasets #59

sjordan29 opened this issue May 17, 2024 · 10 comments

Comments

@sjordan29
Copy link

Hi @abkfenris, I've been maintaining Catalog-To-Xpublish, and we're having some issues with the opendap response for one of the USGS datasets we're working with. Most datasets available do work, so this seems to be anomaly to work out with this dataset.

The Problem Dataset

We're having issues with PRISM_v2; you can view the xarray structure here: https://labs-beta.waterdata.usgs.gov/api/xpublish/PRISM_v2/datasets/zarr-s3-osn/

The Problem

When try to open with xarray like so:

OPENDAP_URL = 'https://labs-beta.waterdata.usgs.gov/api/xpublish/PRISM_v2/datasets/zarr-s3-osn/opendap'
xr_dataset = xr.open_dataset(
    OPENDAP_URL,
    engine='netcdf4',
)

I get the following error: IndexError: The indexing operation you are attempting to perform is not valid on netCDF4.Variable object. Try loading your data into memory first by calling .load().

Thoughts

Seems like this might have something to do with the tbnd dimension (which is not a coordinate) and/or its associated variable, time_bnds. Note that we don't see this issue when I just request ppt variable, which has dimensions of lat/lon/time.

## time_bnds
dods_url = 'https://labs-beta.waterdata.usgs.gov/api/xpublish/PRISM_v2/datasets/zarr-s3-osn/opendap?time_bnds'
nc_dataset = nc.Dataset(dods_url)
# opening w/ xarray fails with IndexError:
# The indexing operation you are attempting to
# perform is not valid on netCDF4.Variable object.
xr_dataset = xr.open_dataset(
  xr.backends.NetCDF4DataStore(
    nc_dataset
  ),
  decode_times=False
)

## ppt
dods_url = 'https://labs-beta.waterdata.usgs.gov/api/xpublish/PRISM_v2/datasets/zarr-s3-osn/opendap?ppt'
nc_dataset = nc.Dataset(dods_url)
# This works.
xr_dataset = xr.open_dataset(
  xr.backends.NetCDF4DataStore(
    nc_dataset
  ),
  decode_times=False
)

As I understand it, xpublish-opendap reformats xarray datasets using dap_xarray to a compatible format to leverageopendap-protocol . I'm wondering if there's something going on with some of these translations between data structures that might be causing this problem. Do you have any ideas of what might be going on here? Happy to help rig up a solve, but your understanding of these libraries would be super helpful to help me nail down the problem.

@abkfenris
Copy link
Member

Hi @sjordan29 ,

I actually just ran into that same IndexError today while trying to work on xpublish-community/xpublish#246 . From the structure of your dataset, I'm guessing it's also an error with the coordinate attributes not currently being translated correctly into an OpenDAP representation, and Xarray is getting a little cranky when trying to open it.

Any chance you could open the original Zarr source locally, and slice it down to a minimum dataset that causes that error and share a NetCDF of that?

That way I can throw it into the test suite and compare what Xarray opens directly vs what it can load over OpenDAP.

@sjordan29
Copy link
Author

Thanks so much! I added a slice of the data and the notebook showing the error I'm getting on our fork here: https://github.com/LimnoTech/xpublish-opendap/tree/index-error/notebooks/dev_sandbox

@aufdenkampe
Copy link

@abkfenris, thanks for looking into this issue.

BTW, USGS just made their xpublish-builder repo public, which shows how we pull everything together to create the OpenDAP endpoints exposed at https://labs-beta.waterdata.usgs.gov/api/xpublish/catalogs. Our hope that this builder repo might be useful for you or others working on https://ioos.github.io/ioos-code-sprint/2024/topics/06-xpublish.html

@abkfenris
Copy link
Member

Great, thank you both! I'll try to take a look later or tomorrow.

@abkfenris
Copy link
Member

abkfenris commented May 22, 2024

Hmm, it looks like we're (at least) generating different .dds compared to what THREDDS will generate for this dataset, so I'm going to go back to the basics to see how to generate these.

THREDDS DDS

Dataset {
    Float32 time_bnds[time = 2][tbnd = 2];
    Float32 time[time = 2];
} datasets/PRISM_v2_slice.nc;

xpublish DDS

Dataset {
    Float32 time[time = 2];
    Float64 tbnd[tbnd = 2];
    Grid {
      Array:
        String time_bnds[time = 2][tbnd = 2];
      Maps:
        Float32 time[time = 2];
        Float64 tbnd[tbnd = 2];
    } time_bnds;
} ds;

abkfenris added a commit that referenced this issue May 23, 2024
Make sure that dataarrays are encoded as well to hopefully solve IndexError issues when datetimes are not coordinates.

Works on #59
This was referenced May 23, 2024
@abkfenris
Copy link
Member

abkfenris commented May 30, 2024

Ok, got a bit farther into the weeds here.

When Hyrax or THREDDS encounters a dataarray that has a non coordinate dimension, instead of returning a DAP Grid with the dimensions mapped, they instead return a DAP Array with multiple dimensions.

Currently the underlying OpenDAP library that we are using is hard coded to only respond with single dimension arrays that aren't Grids.

https://github.com/MeteoSwiss/opendap-protocol/blob/5d8b1261847c969cfa2eedc2a42b31c84b48035e/opendap_protocol/protocol.py#L443-L447

In #60 I've gone and added some extra encoding checks and a custom OpenDAP Array that encodes dimensions similarly to Hyrax and THREDDS. This seems to take care of IndexErrors in some cases.

I've included your PRISM dataset as test data and am running automated tests against it, to make sure we don't have a regression. Strangely with Py.test locally and in the Github Actions matrix it passes, but when tested locally with a nox matrix it fails due to IndexErrors so I'm even more confused.

It would be great if you all could try testing that PR and let me know if that does the trick.

Some related issues I found along the way:

@sjordan29
Copy link
Author

@abkfenris, thanks so much for digging into this! I will test #60 with PRISM in our Catalog-To-Xpublish implementation and let you know how it goes.

@sjordan29
Copy link
Author

Opening With xarray

Just took a look, and the following still gives me an IndexError

opendap_url = 'http://localhost:8000/api/xpublish/PRISM_v2/datasets/zarr-s3-osn/opendap'
xr.open_dataset(
    opendap_url,
    engine='netcdf4',
    decode_times=False,
)

However, I can query the time_bnds variable individually now, which did not work before:

dods_url = 'https://labs-beta.waterdata.usgs.gov/api/xpublish/PRISM_v2/datasets/zarr-s3-osn/opendap?time_bnds'
xr_dataset = xr.open_dataset(
  dods_url,
  decode_times=False
)

image

I can query every single individual coordinate and variable on its own (and together in a comma separated list) except for the tbnd dimension, and the whole xarray will lazy load:

opendap_url = 'http://localhost:8000/api/xpublish/PRISM_v2/datasets/zarr-s3-osn/opendap?crs,ppt,tmn,tmx,time_bnds'`

data_array = xr.open_dataset(
    opendap_url,
    engine='netcdf4',
    decode_times=False,
)

image

So this feels really close. It's just that opendap_url = 'http://localhost:8000/api/xpublish/PRISM_v2/datasets/zarr-s3-osn/opendap' that still gives an IndexError

I do see the changes reflected in the dds and das response.

@abkfenris
Copy link
Member

Can you try building a new environment? I found that I was sporadically getting those IndexErrors in testing when Nox resolved a different environment than I had locally or in Github Actions on this PR. I compared pip list outputs but didn't find any version differences that jumped out at me.

@sjordan29
Copy link
Author

Hi @abkfenris, just got to building a new environment -- apologies for the delay. Still getting the error. A couple notes:

  1. We're building Catalog-To-Xpublish in docker, so that environment re-builds every time I build the docker image from scratch. The environment we're using can be found here: https://code.usgs.gov/wma/nhgf/xpublish/-/tree/xpublish-opendap?ref_type=heads. Attaching the pip list from inside that docker container.
    docker-pip-list.txt

  2. Just rebuilt my local environment used when requesting against the service as well, and still encountered the same behavior as in my most recent comment. Attaching the pip list from that local environment as well.
    local-pip-list.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants