[pull] main from pydata:main #582

pull · 2024-10-14T01:27:35Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

* Reimplement DataTree aggregations They now allow for dimensions that are missing on particular nodes, and use Xarray's standard generate_aggregations machinery, like aggregations for DataArray and Dataset. Fixes #8949, #8963 * add API docs on DataTree aggregations * remove incorrectly added sel methods * fix docstring reprs * mypy fix * fix self import * remove unimplemented agg methods * replace dim_arg_to_dims_set with parse_dims * add parse_dims_as_set * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix mypy errors * change tests to match slightly different error now thrown --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: TomNicholas <[email protected]>

…ee function (#9614) * updating group type annotation for netcdf, hdf5, and zarr open_datatree function * supporting only in group type annotation for netcdf, hdf5, and zarr open_datatree function * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Rename inherited -> inherit in DataTree.to_dataset * fixed one missed instance of kwarg from #9602 --------- Co-authored-by: Tom Nicholas <[email protected]>

* remove too-long underline * draft section on data alignment * fixes * draft section on coordinate inheritance * various improvements * more improvements * link from other page * align call include all 3 datasets * link back to use cases * clarification * small improvements * remove TODO after #9532 * add todo about #9475 * correct xr.align example call * add links to netCDF4 documentation * Consistent voice Co-authored-by: Maximilian Roos <[email protected]> * keep indexes in lat lon selection to dodge #9475 * unpack generator properly Co-authored-by: Stephan Hoyer <[email protected]> * ideas for next section * briefly summarize what alignment means * clarify that it's the data in each node that was previously unrelated * fix incorrect indentation of code block * display the tree with redundant coordinates again * remove content about non-inherited coords for a follow-up PR * remove todo * remove todo now that aggregations are re-implemented * remove link to (unmerged) migration guide * remove todo about improving error message * correct statement in data-structures docs * fix internal link --------- Co-authored-by: Maximilian Roos <[email protected]> Co-authored-by: Stephan Hoyer <[email protected]>

* test unary op * implement and generate unary ops * test for unary op with inherited coordinates * re-enable arithmetic tests * implementation for binary ops * test ds * dt commutativity * ensure other types defer to DataTree, thus fixing #9365 * test for inplace binary op * pseudocode implementation of inplace binary op, and xfail test * remove some unneeded type: ignore comments * return type should be DataTree * type datatree ops as accepting dataset-compatible types too * use same type hinting hack as Dataset does for __eq__ not being same as Mapping * ignore return type * add some methods to api docs * don't try to import DataTree.astype in API docs * test to check that single-node trees aren't broadcast * return NotImplemented * remove pseudocode for inplace binary ops * map_over_subtree -> map_over_datasets

* sketch of migration guide * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * whatsnew * add date * spell out API changes in more detail * details on backends integration * explain alignment and open_groups * explain coordinate inheritance * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * re-trigger CI * remove bullet about map_over_subtree * Markdown formatting for important warning block Co-authored-by: Matt Savoie <[email protected]> * Reorder changes in order of importance Co-authored-by: Matt Savoie <[email protected]> * Clearer wording on setting relationships Co-authored-by: Matt Savoie <[email protected]> * remove "technically" Co-authored-by: Matt Savoie <[email protected]> --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Matt Savoie <[email protected]>

As mentioned in #2157, the docstring of `Dataset.groupby` does not reflect deprecation of squeeze (as the docstring of `DataArray.groupby` does) and states an incorrect default value.

* Add inherit=False option to DataTree.copy() This PR adds a inherit=False option to DataTree.copy, so users can decide if they want to inherit coordinates from parents or not when creating a subtree. The default behavior is `inherit=True`, which is a breaking change from the current behavior where parent coordinates are dropped (which I believe should be considered a bug). * fix typing * add migration guide note * ignore typing error

* Bug fixes for DataTree indexing and aggregation My implementation of indexing and aggregation was incorrect on child nodes, re-creating the child nodes from the root. There was also another bug when indexing inherited coordinates that meant formerly inherited coordinates were incorrectly dropped from results. * disable broken test

@headtr1ck

as suggested by @headtr1ck in #9628 (comment)

* type hints for datatree ops tests * type hints for datatree aggregations tests * type hints for datatree indexing tests * type hint a lot more tests * more type hints

* Add zip_subtrees for paired iteration over DataTrees This should be used for implementing DataTree arithmetic inside map_over_datasets, so the result does not depend on the order in which child nodes are defined. I have also added a minimal implementation of breadth-first-search with an explicit queue the current recursion based solution in xarray.core.iterators (which has been removed). The new implementation is also slightly faster in my microbenchmark: In [1]: import xarray as xr In [2]: tree = xr.DataTree.from_dict({f"/x{i}": None for i in range(100)}) In [3]: %timeit _ = list(tree.subtree) # on main 87.2 μs ± 394 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) # with this branch 55.1 μs ± 294 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) * fix pytype error * Tweaks per review

If the file is empty (or contains no variables matching any filtering done by the backend), use a different error message indicating that, rather than suggesting that the file has too many variables for this function.

* Updates to DataTree.equals and DataTree.identical In contrast to `equals`, `identical` now also checks that any inherited variables are inherited on both objects. However, they do not need to be inherited from the same source. This aligns the behavior of `identical` with the DataTree `__repr__`. I've also removed the `from_root` argument from `equals` and `identical`. If a user wants to compare trees from their roots, a better (simpler) inference is to simply call these methods on the `.root` properties. I would also like to remove the `strict_names` argument, but that will require switching to use the new `zip_subtrees` (#9623) first. * More efficient check for inherited coordinates

)

* Fix error and probably missing code cell in io.rst * Make this even simpler, remove link to same section

* Replace black with ruff-format * Fix formatting mistakes moving mypy comments * Replace black with ruff in the contributing guides

* Add zip_subtrees for paired iteration over DataTrees This should be used for implementing DataTree arithmetic inside map_over_datasets, so the result does not depend on the order in which child nodes are defined. I have also added a minimal implementation of breadth-first-search with an explicit queue the current recursion based solution in xarray.core.iterators (which has been removed). The new implementation is also slightly faster in my microbenchmark: In [1]: import xarray as xr In [2]: tree = xr.DataTree.from_dict({f"/x{i}": None for i in range(100)}) In [3]: %timeit _ = list(tree.subtree) # on main 87.2 μs ± 394 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) # with this branch 55.1 μs ± 294 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) * fix pytype error * Re-implement map_over_datasets The main changes: - It is implemented using zip_subtrees, which means it should properly handle DataTrees where the nodes are defined in a different order. - For simplicity, I removed handling of `**kwargs`, in order to preserve some flexibility for adding keyword arugments. - I removed automatic skipping of empty nodes, because there are almost assuredly cases where that would make sense. This could be restored with a option keyword arugment. * fix typing of map_over_datasets * add group_subtrees * wip fixes * update isomorphic * documentation and API change for map_over_datasets * mypy fixes * fix test * diff formatting * more mypy * doc fix * more doc fix * add api docs * add utility for joining path on windows * docstring * add an overload for two return values from map_over_datasets * partial fixes per review * fixes per review * remove a couple of xfails

* _inherited_vars -> inherited_vars * implementation using Coordinates * datatree.DataTree -> xarray.DataTree * only show inherited coordinates on root * test that there is an Inherited coordinates header

* flox: Properly propagate multiindex Closes #9648 * skip test on old pandas * small optimization * fix

* Fix multiple grouping with missing groups Closes #9360 * Small repr improvement * Small optimization in mask * Add whats-new * fix doctests

…ests (#9651) * Add close() method to DataTree and clean-up open files in tests This removes a bunch of warnings that were previously issued in unit-tests. * Unit tests for closing functionality

…ap_blocks`` (#9658) * Reduce graph size through writing indexes directly into graph for map_blocks * Reduce graph size through writing indexes directly into graph for map_blocks * Update xarray/core/parallel.py --------- Co-authored-by: Deepak Cherian <[email protected]>

* Remove zarr pin * Define zarr_v3 helper * zarr-v3: filters / compressors -> codecs * zarr-v3: update tests to avoid values equal to fillValue * Various test fixes * zarr_version fixes * removed open_consolidated workarounds * removed _store_version check * pass through zarr_version * fixup! zarr-v3: filters / compressors -> codecs * fixup! fixup! zarr-v3: filters / compressors -> codecs * fixup * path / key normalization in set_variables * fixes * workaround nested consolidated metadata * test: avoid fill_value * test: Adjust call counts * zarr-python 3.x Array.resize doesn't mutate * test compatibility - skip write_empty_chunks on 3.x - update patch targets * skip ZipStore with_mode * test: more fill_value avoidance * test: more fill_value avoidance * v3 compat for instrumented test * Handle zarr_version / zarr_format deprecation * wip * most Zarr tests passing * unskip tests * add custom Zarr _FillValue encoding / decoding * relax dtype comparison in test_roundtrip_empty_vlen_string_array * fix test_explicitly_omit_fill_value_via_encoding_kwarg * fix test_append_string_length_mismatch_raises * fix test_check_encoding_is_consistent_after_append for v3 * skip roundtrip_endian for zarr v3 * unskip datetimes and fix test_compressor_encoding * unskip tests * add back dtype skip * point upstream to v3 branch * Create temporary directory before using it * Avoid zarr.storage.zip on v2 * fixed close_store_on_close bug * Remove workaround, fixed upstream * Restore original `w` mode. * workaround for store closing with mode=w * typing fixes * compat * Remove unnecessary pop * fixed skip * fixup types * fixup types * [test-upstream] * Update install-upstream-wheels.sh * set use_consolidated to false when user provides consolidated=False * fix: import consolidated_metadata from package root * fix: relax instrumented store checks for v3 * Adjust 2.18.3 thresholds * skip datatree zarr tests w/ zarr 3 for now * fixed kvstore usage * typing fixes * move zarr.codecs import * fixup ignores * storage options fix, skip * fixed types * Update ci/install-upstream-wheels.sh * type fixes * whats-new * Update xarray/tests/test_backends_datatree.py * fix type import * set mapper, chunk_mapper * Pass through zarr_format * Fixup * more cleanup * revert test changes * Update xarray/backends/zarr.py * cleanup * update docstring * fix rtd * tweak --------- Co-authored-by: Ryan Abernathey <[email protected]> Co-authored-by: Joe Hamman <[email protected]> Co-authored-by: Deepak Cherian <[email protected]> Co-authored-by: Deepak Cherian <[email protected]>

* support chunking and default values in `open_groups` * same for `open_datatree` * use `group_subtrees` instead of `map_over_datasets` * check that `chunks` on `open_datatree` works * specify the chunksizes when opening from disk * check that `open_groups` with chunks works, too * require dask for `test_open_groups_chunks` * protect variables from write operations * copy over `_close` from the backend tree * copy a lot of the docstring from `open_dataset` * same for `open_groups` * reuse `_protect_dataset_variables_inplace` * final missing `requires_dask` * typing for the test utils Co-authored-by: Tom Nicholas <[email protected]> * type hints for `_protect_datatree_variables_inplace` Co-authored-by: Tom Nicholas <[email protected]> * type hints for `_protect_dataset_variables_inplace` * copy over the name of the backend tree Co-authored-by: Tom Nicholas <[email protected]> * typo * swap the order of arguments to `assert_identical` * try explicitly typing `data` * typo * use `Hashable` for variable names --------- Co-authored-by: Tom Nicholas <[email protected]> Co-authored-by: Tom Nicholas <[email protected]>

* implement `compute` and `load` * also shallow-copy variables * implement `chunksizes` * add tests for `load` * add tests for `chunksizes` * improve the `load` tests using `DataTree.chunksizes` * add a test for `compute` * un-xfail a xpassing test * implement and test `DataTree.chunk` * link to `Dataset.load` Co-authored-by: Tom Nicholas <[email protected]> * use `tree.subtree` to get absolute paths * filter out missing dims before delegating to `Dataset.chunk` * fix the type hints for `DataTree.chunksizes` * try using `self.from_dict` instead * type-hint intermediate test variables * use `_node_dims` instead * raise on unknown chunk dim * check that errors in `chunk` are raised properly * adapt the docstrings of the new methods * allow computing / loading unchunked trees * reword the `chunksizes` properties * also freeze the top-level chunk sizes * also reword `DataArray.chunksizes` * fix a copy-paste error * same for `NamedArray.chunksizes` --------- Co-authored-by: Tom Nicholas <[email protected]>

Co-authored-by: Tom Nicholas <[email protected]>

* use zarr v3 dimension_names * Update xarray/backends/zarr.py Co-authored-by: Stephan Hoyer <[email protected]> * Update xarray/backends/zarr.py Co-authored-by: Joe Hamman <[email protected]> --------- Co-authored-by: Stephan Hoyer <[email protected]> Co-authored-by: Joe Hamman <[email protected]> Co-authored-by: Deepak Cherian <[email protected]> Co-authored-by: Joe Hamman <[email protected]>

* adding draft for fixing behaviour for group parameter * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * new trial * new trial * fixing duplicate pahts and path in the root group * removing yield str(gpath) * implementing the proposed solution to hdf5 and netcdf backends * adding changes to whats-new.rst * removing encoding['source_group'] line to avoid conflicts with PR #9660 * adding test * adding test * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * adding assert subgroup_tree.root.parent is None * modifying tests * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update xarray/tests/test_backends_datatree.py Co-authored-by: Justus Magin <[email protected]> * applying suggested changes * updating test * adding Justus and Alfonso to the list of contributors to the DataTree entry * adding Justus and Alfonso to the list of contributors to the DataTree entry --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tom Nicholas <[email protected]> Co-authored-by: Justus Magin <[email protected]>

* check that the length of fixed-width numpy strings is reset * drop the length from numpy's fixed-width string dtypes * compatibility with `numpy<2` * use `issubdtype` instead * some more test cases * more details in the comment --------- Co-authored-by: Tom Nicholas <[email protected]>

* release summary * move some notes to correct version

shoyer and others added 4 commits October 13, 2024 14:03

Rename inherited -> inherit in DataTree.to_dataset (#9615)

93b4859

* Rename inherited -> inherit in DataTree.to_dataset * fixed one missed instance of kwarg from #9602 --------- Co-authored-by: Tom Nicholas <[email protected]>

pull bot added the ⤵️ pull label Oct 14, 2024

kmuehlbauer and others added 25 commits October 14, 2024 15:52

pin mypy to 1.11.2 (#9621)

017279c

map_over_subtree -> map_over_datasets (#9622)

33ead65

docs(groupby): mention deprecation of squeeze kwarg (#9625)

c3dabe1

As mentioned in #2157, the docstring of `Dataset.groupby` does not reflect deprecation of squeeze (as the docstring of `DataArray.groupby` does) and states an incorrect default value.

Add missing memo argument to DataTree.__deepcopy__ (#9631)

aafc278

as suggested by @headtr1ck in #9628 (comment)

Type check datatree tests (#9632)

88a95cf

* type hints for datatree ops tests * type hints for datatree aggregations tests * type hints for datatree indexing tests * type hint a lot more tests * more type hints

DOC: Clarify error message in open_dataarray (#9637)

3c01ced

If the file is empty (or contains no variables matching any filtering done by the backend), use a different error message indicating that, rather than suggesting that the file has too many variables for this function.

Support alternative names for the root node in DataTree.from_dict (#9638

7046255

)

Fix error and missing code cell in io.rst (#9641)

1e579fb

* Fix error and probably missing code cell in io.rst * Make this even simpler, remove link to same section

Replace black and blackdoc with ruff-format (#9506)

b9780e7

* Replace black with ruff-format * Fix formatting mistakes moving mypy comments * Replace black with ruff in the contributing guides

fix zarr intersphinx (#9652)

8f6e45b

Update Datatree html repr to indicate inheritance (#9633)

cfaa72f

* _inherited_vars -> inherited_vars * implementation using Coordinates * datatree.DataTree -> xarray.DataTree * only show inherited coordinates on root * test that there is an Inherited coordinates header

flox: Properly propagate multiindex (#9649)

01831a4

* flox: Properly propagate multiindex Closes #9648 * skip test on old pandas * small optimization * fix

Fix multiple grouping with missing groups (#9650)

df87f69

* Fix multiple grouping with missing groups Closes #9360 * Small repr improvement * Small optimization in mask * Add whats-new * fix doctests

Change URL for pydap test (#9655)

ed32ba7

Add close() method to DataTree and use it to clean-up open files in t…

863184d

…ests (#9651) * Add close() method to DataTree and clean-up open files in tests This removes a bunch of warnings that were previously issued in unit-tests. * Unit tests for closing functionality

Update to_dataframe doc to match current behavior (#9662)

4798707

keewis and others added 9 commits October 24, 2024 11:15

fix(zarr): use inplace array.resize for zarr 2 and 3 (#9673)

5b2e6f1

Co-authored-by: Tom Nicholas <[email protected]>

v2024.10.0 release summary (#9678)

165177e

* release summary * move some notes to correct version

new blank whatsnew (#9679)

5be821b

Fix inadvertent deep-copying of child data in DataTree (#9684)

dbb98b4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] main from pydata:main #582

[pull] main from pydata:main #582

pull bot commented Oct 14, 2024 •

edited

Loading

[pull] main from pydata:main #582

Are you sure you want to change the base?

[pull] main from pydata:main #582

Conversation

pull bot commented Oct 14, 2024 • edited Loading

pull bot commented Oct 14, 2024 •

edited

Loading