Propagating multi-variable indexes in dataarrays #7538

benbovy · 2023-02-16T13:20:13Z

benbovy
Feb 16, 2023
Maintainer

This has been discussed during the last Xarray community developers meeting. Briefly summarized, the problem is that multi-coordinate indexes may not be propagated properly in DataArray objects since the dimensions of a DataArray must correspond to the ones of the main array variable.

For example, let's consider this Dataset:

>>> ds
<xarray.Dataset>
Dimensions:  (x_c: 9, x_g: 9, y_c: 9, y_g: 9)
Coordinates:
  * x_c      (x_c) float64 ...
  * x_g      (x_g) float64 ...
  * y_c      (y_c) float64 ...
  * y_g      (y_g) float64 ...
Data variables:
    temp     (x_c, y_c) float64 ...
Indexes:
  ┌ x_c      GridIndex
  │ x_g
  │ y_c
  └ y_g

The x_g, y_g and x_c, y_c dimension coordinates are respectively representing the left and center node positions of a staggered grid with two X, Y physical axes. They are all backed by a GridIndex that allows grid-aware operations using Xarray's API directly. The temp data variable represents a scalar field on the grid (center nodes).

When handling the temp variable separately as a DataArray object we only keep the x_c and y_c dimensions of that variable, i.e., we loose the explicit relationship between the grid index and its x_g, y_g coordinates.

How to deal with this? Some ideas have been suggested at the meeting. Let me try to outline those (+ other) options below.

cc @dcherian @keewis @shoyer @TomNicholas

1. Drop the index

This is the easiest option but that's not convenient at all.

>>> ds["temp"]
<xarray.DataArray "temp" (x_c: 9, y_c: 9)>
...
Coordinates:
    x_c      (x_c) float64 ...
    y_c      (y_c) float64 ...
Indexes:
    *empty*

2. Keep the index

Propagate the index as-is.

>>> da = ds["temp"]
>>> da
<xarray.DataArray "temp" (x_c: 9, y_c: 9)>
...
Coordinates:
    x_c      (x_c) float64 ...
    y_c      (y_c) float64 ...
Indexes:
  ┌ x_c      GridIndex
  └ y_c

Here the GridIndex is the same object than in the Dataset. It still contains all grid information (e.g., it could wrap a PandasIndex for each of the x_g, y_g and x_c, y_c coordinates) but it just has two explicit coordinate references in the extracted DataArray. When using the index via the DataArray, the whole grid information is still used and maybe updated, e.g.,

# 
# `GridIndex.sel()` called via `da.sel()` below will also internally
# subset its left node `x_g` labels!
#
>>> da.sel(x_c=[5, 6, 7])
<xarray.DataArray "temp" (x_c: 3, y_c: 9)>
...
Coordinates:
    x_c      (x_c) float64 ...
    y_c      (y_c) float64 ...
Indexes:
  ┌ x_c      GridIndex
  └ y_c

Other operations may not be that straightfoward, though. For example, converting back the DataArray to a Dataset may be ambiguous:

#
# New dimensions and coordinates magically appear??? How?
# 
>>> da.to_dataset()
<xarray.Dataset>
Dimensions:  (x_c: 9, x_g: 9, y_c: 9, y_g: 9)
Coordinates:
  * x_c      (x_c) float64 ...
  * x_g      (x_g) float64 ...
  * y_c      (y_c) float64 ...
  * y_g      (y_g) float64 ...
Data variables:
    temp     (x_c, y_c) float64 ...
Indexes:
  ┌ x_c      GridIndex
  │ x_g
  │ y_c
  └ y_g

3. Separate (but related) indexes

E.g., for the example above have two separate indexes for the center and left node positions, respectively:

>>> ds
<xarray.Dataset>
Dimensions:  (x_c: 9, x_g: 9, y_c: 9, y_g: 9)
Coordinates:
  * x_c      (x_c) float64 ...
  * x_g      (x_g) float64 ...
  * y_c      (y_c) float64 ...
  * y_g      (y_g) float64 ...
Data variables:
    temp     (x_c, y_c) float64 ...
Indexes:
  ┌ x_c      GridCenterIndex
  └ y_c
  ┌ x_g      GridLeftIndex
  └ y_g

>>> ds["temp"]
<xarray.DataArray "temp" (x_c: 9, y_c: 9)>
...
Coordinates:
    x_c      (x_c) float64 ...
    y_c      (y_c) float64 ...
Indexes:
  ┌ x_c      GridCenterIndex
  └ y_c

Where GridCenterIndex and GridLeftIndex would somehow point to each other. It is not very clear to me how this would work, though.

4. DataArray "auxiliary" dimensions

The DataArray data model would be augmented by the introduction of "auxiliary dimensions", i.e., all dimensions that are present in the DataArray coordinates but not in the main variable. This would allow propagating all index coordinates without touching the dimensions of the DataArray.

This would work very similarly to option 2, except that it is a bit more explicit (converting back to a Dataset would look less magical).

Auxiliary dimensions are not very useful, it is just some information that is propagated. Support would be also very limited, i.e., do not allow direct interaction with it (e.g., do not allow da.isel(aux_dim=...)).

>>> da = ds["temp"]

>>> da.dims
{"x_c": 9, "y_c": 9}
>>> da.aux_dims
{"x_g": 9, "y_g": 9}

>>> da
<xarray.DataArray "temp" (x_c: 9, y_c: 9)>
...
Coordinates:
  * x_c      (x_c) float64 ...
  * x_g      (x_g) float64 ...
  * y_c      (y_c) float64 ...
  * y_g      (y_g) float64 ...
Auxiliary Dimensions: x_g, y_g
Indexes:
  ┌ x_c      GridIndex
  │ x_g
  │ y_c
  └ y_g

# 
# `x_g` coordinate also updated accordingly!
#
>>> da_sel = da.sel(x_c=[5, 6, 7])
>>> da_sel
<xarray.DataArray "temp" (x_c: 3, y_c: 9)>
...
Coordinates:
  * x_c      (x_c) float64 ...
  * x_g      (x_g) float64 ...
  * y_c      (y_c) float64 ...
  * y_g      (y_g) float64 ...
Auxiliary Dimensions: x_g, y_g
Indexes:
  ┌ x_c      GridIndex
  │ x_g
  │ y_c
  └ y_g

>>> da_sel.dims
{"x_c": 3, "y_c": 9}
>>> da_sel.aux_dims
{"x_g": 3, "y_g": 9}

>>> da_sel.to_dataset()
<xarray.Dataset>
Dimensions:  (x_c: 3, x_g: 3, y_c: 9, y_g: 9)
Coordinates:
  * x_c      (x_c) float64 ...
  * x_g      (x_g) float64 ...
  * y_c      (y_c) float64 ...
  * y_g      (y_g) float64 ...
Data variables:
    temp     (x_c, y_c) float64 ...
Indexes:
  ┌ x_c      GridIndex
  │ x_g
  │ y_c
  └ y_g

I haven't thought much how easy/hard would it be to implement this, though. Not sure what kind of technical difficulties we would encounter.

5. Coordinates "auxiliary" dimensions

Very similar to option 4 but addresses the problem at the level of Xarray Coordinates (once we refactor in Xarray both indexes and coordinate variables into a unique Coordinates container encapsulated in Dataset / DataArray).

dcherian · 2023-02-16T19:54:14Z

dcherian
Feb 16, 2023
Maintainer

cc @Huite who's thought about this for unstructured grids.

0 replies

shoyer · 2023-02-17T00:40:09Z

shoyer
Feb 17, 2023
Maintainer

I don't like option (2), which would allow for indexes without coordinate variables. This seems like it breaks an important invariant (all indexes are also coordinates).

I think auxilliary dimensions on DataArray indexes/coordinates would be fine in principle, but we would an updated rule for deciding when to keep a dimension around versus when to drop it. The current rule is "only keep coordinates/indexes as long as their dimensions are also on the DataArray itself"

I see at least three ways to do this:

Switch to a different rule for when to keep dimensions around on DataArray objects produced from a Dataset. This allow for the desired ds['temp'] syntax, but would be a potentially breaking change.
Add a new method for creating a DataArray that allows for explicitly choosing which dimensions to keep around, e.g., ds.get('temp', aux_dims=['x_g', 'y_g'])
Use something new in Xarray's data model to identify "auxilliary" dimensions, and let users opt into this, e.g., perhaps we always keep around auxilliary dimensions if they are used on a multi-coordinate index. This allows for ds['temp'] syntax without existing code.

Of these options, (2) and/or (3) are most appealing to me, because I doubt we can come up with new rules that would work well in every case.

1 reply

benbovy Feb 17, 2023
Maintainer Author

I doubt we can come up with new rules that would work well in every case.

Yes you are probably right.

Maybe an option 4 would be that it is up to the index itself to identify "auxilliary" dimensions? The current logic for filtering out the indexes from a set of selected coordinates is here (it simply drops the index if some of its coordinates are discarded). Instead of a general rule there could be some API entry point added to xarray.indexes.Index so that an index may decide to return itself / a new index / no index (+ coordinates) from a set of input coordinate names.

In the staggered GridIndex example, data variables either have (x_c, y_c) or (x_g, y_g) dimensions but likely not other combinations I guess. For ds['temp'] (or for a data variable defined on the left grid nodes), the GridIndex would return the full index and all of its coordinates. For other cases like ds['x_c'], a simple PandasIndex makes more sense than returning the full index (and it is actually more useful than dropping the index).

IMO the advantage of this option over options (2) and (3) is that there's probably no need for users to always choose which dimensions (coordinates) to keep around. User opt-in is also done pretty explicitly via .set_xindex().

Huite · 2023-02-20T09:40:47Z

Huite
Feb 20, 2023

Thanks @dcherian for the cc.

I've been working on some selection for unstructured grids, via the UGRID conventions.
In case of a 2D grid topology, there are three "linked" dimensions: the nodes (vertices), the edges, and the faces (the cells). In case of a UGRID index, these are inseparable. The data can be present on any of the three dimensions. (There's also UGRID-1D and 3D grid topologies, but they share the same principle).

From this view, I think the auxiliary dimensions feel the most straightforward. For a 2D unstructured triangular mesh topology:

>>> da = ds["temp"]
>>> da.dims
{"mesh2d_n_face": 4}

>>> da.aux_dim
{"mesh2d_n_edge": 11, "mesh_n_node": 7, "mesh2d_n_node_per_face": 3, "two": 2}

>>> da
<xarray.DataArray "temp" (mesh2d_nface: 4)
...
Coordinates:
* mesh2d_face_node_connectivity (mesh2d_n_face, mesh2d_n_node_per_face) int64 ...
* mesh2d_node_x (mesh2d_n_node) float64 ...
* mesh2d_node_y (mesh2d_n_node) float64 ...
* mesh2d_edge_node_connectivity (mesh2d_n_edge, two) int64 ...
Dimensions: mesh2d_n_face
Auxiliary Dimensions: mesh2d_n_edge, mesh_n_node, mesh2d_n_node_per_face, two
Indexes:
  ┌ mesh2d_face_node_connectivity      UgridIndex
  │ mesh2d_node_x
  │ mesh2d_node_y
  └ mesh2d_edge_node_connectivity

For this data, you would only be able to broadcast / reduce on the non-auxiliary dimension (no direct interaction as @benbovy mentions). The index, auxiliary dims and associated coords would be dropped with the non auxiliary-dim (mesh2d_n_face).

It might happen that multiple auxiliary dimensions are required. E.g. adding layer and bounds:

>>> da = ds["temp"]
>>> da.dims
{"layer": 3, "mesh2d_n_face": 4}

>>> da.aux_dim
{"mesh2d_n_edge": 11, "mesh_n_node": 7, "mesh2d_n_node_per_face": 3, "two": 2, "layer_n_bound": 2}

>>> da
<xarray.DataArray "temp" (layer: 3, mesh2d_nface: 4)
...
Coordinates:
* mesh2d_face_node_connectivity (mesh2d_n_face, mesh2d_n_node_per_face) int64 ...
* mesh2d_node_x (mesh2d_n_node) float64 ...
* mesh2d_node_y (mesh2d_n_node) float64 ...
* mesh2d_edge_node_connectivity (mesh2d_n_edge, two) int64 ...
* layer_bounds (layer, mesh2d_n_face, layer_n_bound) float64 ...
Dimensions: layer, mesh2d_n_face
Auxiliary Dimensions: mesh2d_n_edge, mesh_n_node, mesh2d_n_node_per_face, two, layer_n_bound
Indexes:
  ┌ mesh2d_face_node_connectivity      UgridIndex
  │ mesh2d_node_x
  │ mesh2d_node_y
  └ mesh2d_edge_node_connectivity
  - layer_bounds                       IntervalIndex  # or something

For a dataset, a set_aux_dims() method could be included, creating the "linked" dimensions:

ds = ds.set_aux_dims({
        "layer": ("layer_n_bound",),
        "mesh2d_n_face": ("mesh2d_n_edge", "mesh_n_node", "mesh2d_n_node_per_face", "two"),
    })

Or indeed an optional argument to the set_xindex method, since this is the only time where it comes up?

ds = ds.set_xindex(
    coord_names=(
        "mesh2d_face_node_connectivity",
        "mesh2d_node_x",
        "mesh2d_node_y",
        "mesh2d_edge_node_connectivity",
    ),
    aux_dims={
        "mesh2d_n_face": ("mesh2d_n_edge", "mesh_n_node", "mesh2d_n_node_per_face", "two"),
    },
)

Then ds["temp"] would just work, which I would personally greatly prefer over ds.get("temp", aux_dims=...").

In the example above, both layer_n_bound and two are a 2-sized dimension. In existing netCDF files, they might there just share the two dimension because someone though it pragmatic. This shouldn't be a problem, because a user can't interact with them directly? The logic could be mostly encapsulated in the index as @benbovy suggests.

This seems like the most explicit, least magical way to me. It would cover the needs for dealing with UGRID unstructured grids, and I think it would work for bounds coordinates as well. What other types of coordinates require auxiliary dimensions?

2 replies

benbovy Feb 20, 2023
Maintainer Author

Thanks @Huite for providing such detailed example, that's very helpful!

benbovy Feb 20, 2023
Maintainer Author

For a dataset, a set_aux_dims() method could be included, creating the "linked" dimensions:

Or indeed an optional argument to the set_xindex method, since this is the only time where it comes up?

I think we wouldn't even need to explicitly provide auxiliary dimensions. Unless those are useful outside the context of indexes, this is something that can be computed dynamically from the dimensions of the DataArray variable and the union of the dimensions of its coordinates.

What I suggest in #7538 (reply in thread) is roughly something like:

# in xarray/core/indexes.py

class Index:

    ...

    def get_dataarray_coordinates(
        self,
        coords: Coordinates,
        array_dims: set[Hashable],
    ) -> Coordinates:
        """Given the dimensions of the DataArray variable, returns a collection of
        coordinates and their index(es).
        """
        raise NotImplementedError()

For UGridIndex, the implementation would return the full index (self) and all mesh coordinates variables - face_node_connectivity, node_x, node_y, edge_node_connectivity - for (almost?) all combinations, e.g., array_dims = {"mesh2d_n_face"}, array_dims = {"mesh2d_n_node"}, array_dims = {"mesh2d_n_edge", "two"}, etc.

Xarray will then collect all coordinates returned by each index and also unindexed coordinates that match (a subset of) the DataArray dimensions.

dcherian · 2023-02-21T16:53:41Z

dcherian
Feb 21, 2023
Maintainer

Thanks for writing this up @benbovy !

My proposal was for (2) BUT with explicit propagation of all coordinate variables needed by the Indexes associated with the DataArray's dims : x_c, y_c

>>> da = ds["temp"]

>>> da
<xarray.DataArray "temp" (x_c: 9, y_c: 9)>
....
Coordinates:
  * x_c      (x_c) float64 ...
  * x_g      (x_g) float64 ...
  * y_c      (y_c) float64 ...
  * y_g      (y_g) float64 ...
Indexes:
  ┌ x_c      GridIndex
  │ x_g
  │ y_c
  └ y_g

Some things I like about this are:
a. backwards compatible.
b. The user can easily tell that the whole Index and associated variables are propagated.
c. It satisfies the invariant of all index variables being coordinate variables.
d. It's consistent with the internal model of DataArray as a collection of Variables with a single special Variable.
e. It's basically "auxiliary dimensions" with the rule that the dimensions propagated are those that are encapsulated in the Index objects that represent the "primary dimensions".

I'd prefer the public message be (c) instead of this one (d). It seems a lot easier to communicate

I think this proposal is basically what @benbovy is saying here:

For UGridIndex, the implementation would return the full index (self) and all mesh coordinates variables - face_node_connectivity, node_x, node_y, edge_node_connectivity - for (almost?) all combinations, e.g., array_dims = {"mesh2d_n_face"}, array_dims = {"mesh2d_n_node"}, array_dims = {"mesh2d_n_edge", "two"}, etc. Xarray will then collect all coordinates returned by each index and also unindexed coordinates that match (a subset of) the DataArray dimensions.

3 replies

benbovy Feb 21, 2023
Maintainer Author

@dcherian I agree that if there is no other use case for "auxiliary dimensions" than propagating multi-variable indexes in dataarrays, then there's no need of communicating about it loudly. I find adding a da.aux_dims property and/or one line in the repr a little more transparent and/or less surprising than nothing, though.

dcherian Feb 21, 2023
Maintainer

I find adding a da.aux_dims property and/or one line in the repr a little more transparent and/or less surprising than nothing, though.

They would be in the repr under Coordinates (this is how my proposal differs from your (2)) Also these may include nD variables; so "aux_vars may be better than aux_dims

benbovy Feb 22, 2023
Maintainer Author

Also these may include nD variables; so "aux_vars" may be better than aux_dims

Yes that makes sense. And it is equivalent to the definition of auxiliary coordinate variables in CF conventions isn't it?

After thinking more about it, maybe there is nothing worth to update in the Xarray data model? For unindexed coordinates we still apply the current rule "only keep coordinates as long as their dimensions are also on the DataArray itself" and for each index we let it choose how to propagate itself and its coordinates.

This is highly flexible but this is also a lot of freedom (and thus responsibility) given to the indexes. I can imagine Xarray issue reports saying "coordinates are suddenly dropped for no apparent reason" actually caused by some bug in a 3rd party index. Maybe not a big deal, though?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Propagating multi-variable indexes in dataarrays #7538

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Propagating multi-variable indexes in dataarrays #7538

benbovy Feb 16, 2023 Maintainer

1. Drop the index

2. Keep the index

3. Separate (but related) indexes

4. DataArray "auxiliary" dimensions

5. Coordinates "auxiliary" dimensions

Replies: 4 comments · 6 replies

dcherian Feb 16, 2023 Maintainer

shoyer Feb 17, 2023 Maintainer

benbovy Feb 17, 2023 Maintainer Author

Huite Feb 20, 2023

benbovy Feb 20, 2023 Maintainer Author

benbovy Feb 20, 2023 Maintainer Author

dcherian Feb 21, 2023 Maintainer

benbovy Feb 21, 2023 Maintainer Author

dcherian Feb 21, 2023 Maintainer

benbovy Feb 22, 2023 Maintainer Author

benbovy
Feb 16, 2023
Maintainer

Replies: 4 comments 6 replies

dcherian
Feb 16, 2023
Maintainer

shoyer
Feb 17, 2023
Maintainer

benbovy Feb 17, 2023
Maintainer Author

Huite
Feb 20, 2023

benbovy Feb 20, 2023
Maintainer Author

benbovy Feb 20, 2023
Maintainer Author

dcherian
Feb 21, 2023
Maintainer

benbovy Feb 21, 2023
Maintainer Author

dcherian Feb 21, 2023
Maintainer

benbovy Feb 22, 2023
Maintainer Author