Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] main from pydata:main #582

Open
wants to merge 38 commits into
base: main
Choose a base branch
from
Open

[pull] main from pydata:main #582

wants to merge 38 commits into from

Conversation

pull[bot]
Copy link

@pull pull bot commented Oct 14, 2024

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

shoyer and others added 4 commits October 13, 2024 14:03
* Reimplement DataTree aggregations

They now allow for dimensions that are missing on particular nodes, and
use Xarray's standard generate_aggregations machinery, like aggregations
for DataArray and Dataset.

Fixes #8949, #8963

* add API docs on DataTree aggregations

* remove incorrectly added sel methods

* fix docstring reprs

* mypy fix

* fix self import

* remove unimplemented agg methods

* replace dim_arg_to_dims_set with parse_dims

* add parse_dims_as_set

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix mypy errors

* change tests to match slightly different error now thrown

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: TomNicholas <[email protected]>
…ee function (#9614)

* updating group type annotation for netcdf, hdf5, and zarr open_datatree function

* supporting only  in group type annotation for netcdf, hdf5, and zarr open_datatree function

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Rename inherited -> inherit in DataTree.to_dataset

* fixed one missed instance of kwarg from #9602

---------

Co-authored-by: Tom Nicholas <[email protected]>
* remove too-long underline

* draft section on data alignment

* fixes

* draft section on coordinate inheritance

* various improvements

* more improvements

* link from other page

* align call include all 3 datasets

* link back to use cases

* clarification

* small improvements

* remove TODO after #9532

* add todo about #9475

* correct xr.align example call

* add links to netCDF4 documentation

* Consistent voice

Co-authored-by: Maximilian Roos <[email protected]>

* keep indexes in lat lon selection to dodge #9475

* unpack generator properly

Co-authored-by: Stephan Hoyer <[email protected]>

* ideas for next section

* briefly summarize what alignment means

* clarify that it's the data in each node that was previously unrelated

* fix incorrect indentation of code block

* display the tree with redundant coordinates again

* remove content about non-inherited coords for a follow-up PR

* remove todo

* remove todo now that aggregations are re-implemented

* remove link to (unmerged) migration guide

* remove todo about improving error message

* correct statement in data-structures docs

* fix internal link

---------

Co-authored-by: Maximilian Roos <[email protected]>
Co-authored-by: Stephan Hoyer <[email protected]>
@pull pull bot added the ⤵️ pull label Oct 14, 2024
kmuehlbauer and others added 25 commits October 14, 2024 15:52
* test unary op

* implement and generate unary ops

* test for unary op with inherited coordinates

* re-enable arithmetic tests

* implementation for binary ops

* test ds * dt commutativity

* ensure other types defer to DataTree, thus fixing #9365

* test for inplace binary op

* pseudocode implementation of inplace binary op, and xfail test

* remove some unneeded type: ignore comments

* return type should be DataTree

* type datatree ops as accepting dataset-compatible types too

* use same type hinting hack as Dataset does for __eq__ not being same as Mapping

* ignore return type

* add some methods to api docs

* don't try to import DataTree.astype in API docs

* test to check that single-node trees aren't broadcast

* return NotImplemented

* remove pseudocode for inplace binary ops

* map_over_subtree -> map_over_datasets
* sketch of migration guide

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* whatsnew

* add date

* spell out API changes in more detail

* details on backends integration

* explain alignment and open_groups

* explain coordinate inheritance

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* re-trigger CI

* remove bullet about map_over_subtree

* Markdown formatting for important warning block

Co-authored-by: Matt Savoie <[email protected]>

* Reorder changes in order of importance

Co-authored-by: Matt Savoie <[email protected]>

* Clearer wording on setting relationships

Co-authored-by: Matt Savoie <[email protected]>

* remove "technically"

Co-authored-by: Matt Savoie <[email protected]>

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Matt Savoie <[email protected]>
As mentioned in #2157, the docstring of `Dataset.groupby` does not
reflect deprecation of squeeze (as the docstring of `DataArray.groupby`
does) and states an incorrect default value.
* Add inherit=False option to DataTree.copy()

This PR adds a inherit=False option to DataTree.copy, so users can
decide if they want to inherit coordinates from parents or not when
creating a subtree.

The default behavior is `inherit=True`, which is a breaking change from
the current behavior where parent coordinates are dropped (which I
believe should be considered a bug).

* fix typing

* add migration guide note

* ignore typing error
* Bug fixes for DataTree indexing and aggregation

My implementation of indexing and aggregation was incorrect on child
nodes, re-creating the child nodes from the root.

There was also another bug when indexing inherited coordinates that meant
formerly inherited coordinates were incorrectly dropped from results.

* disable broken test
* type hints for datatree ops tests

* type hints for datatree aggregations tests

* type hints for datatree indexing tests

* type hint a lot more tests

* more type hints
* Add zip_subtrees for paired iteration over DataTrees

This should be used for implementing DataTree arithmetic inside
map_over_datasets, so the result does not depend on the order in which
child nodes are defined.

I have also added a minimal implementation of breadth-first-search with
an explicit queue the current recursion based solution in
xarray.core.iterators (which has been removed). The new implementation
is also slightly faster in my microbenchmark:

    In [1]: import xarray as xr

    In [2]: tree = xr.DataTree.from_dict({f"/x{i}": None for i in range(100)})

    In [3]: %timeit _ = list(tree.subtree)
    # on main
    87.2 μs ± 394 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

    # with this branch
    55.1 μs ± 294 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

* fix pytype error

* Tweaks per review
If the file is empty (or contains no variables matching any filtering done by the backend), use a different error message indicating that, rather than suggesting that the file has too many variables for this function.
* Updates to DataTree.equals and DataTree.identical

In contrast to `equals`, `identical` now also checks that any
inherited variables are inherited on both objects. However, they do
not need to be inherited from the same source. This aligns the
behavior of `identical` with the DataTree `__repr__`.

I've also removed the `from_root` argument from `equals` and `identical`.
If a user wants to compare trees from their roots, a better (simpler)
inference is to simply call these methods on the `.root` properties.
I would also like to remove the `strict_names` argument, but that will
require switching to use the new `zip_subtrees` (#9623) first.

* More efficient check for inherited coordinates
* Fix error and probably missing code cell in io.rst

* Make this even simpler, remove link to same section
* Replace black with ruff-format

* Fix formatting mistakes moving mypy comments

* Replace black with ruff in the contributing guides
* Add zip_subtrees for paired iteration over DataTrees

This should be used for implementing DataTree arithmetic inside
map_over_datasets, so the result does not depend on the order in which
child nodes are defined.

I have also added a minimal implementation of breadth-first-search with
an explicit queue the current recursion based solution in
xarray.core.iterators (which has been removed). The new implementation
is also slightly faster in my microbenchmark:

    In [1]: import xarray as xr

    In [2]: tree = xr.DataTree.from_dict({f"/x{i}": None for i in range(100)})

    In [3]: %timeit _ = list(tree.subtree)
    # on main
    87.2 μs ± 394 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

    # with this branch
    55.1 μs ± 294 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

* fix pytype error

* Re-implement map_over_datasets

The main changes:

- It is implemented using zip_subtrees, which means it should properly
  handle DataTrees where the nodes are defined in a different order.
- For simplicity, I removed handling of `**kwargs`, in order to preserve
  some flexibility for adding keyword arugments.
- I removed automatic skipping of empty nodes, because there are almost
  assuredly cases where that would make sense. This could be restored
  with a option keyword arugment.

* fix typing of map_over_datasets

* add group_subtrees

* wip fixes

* update isomorphic

* documentation and API change for map_over_datasets

* mypy fixes

* fix test

* diff formatting

* more mypy

* doc fix

* more doc fix

* add api docs

* add utility for joining path on windows

* docstring

* add an overload for two return values from map_over_datasets

* partial fixes per review

* fixes per review

* remove a couple of xfails
* _inherited_vars -> inherited_vars

* implementation using Coordinates

* datatree.DataTree -> xarray.DataTree

* only show inherited coordinates on root

* test that there is an Inherited coordinates header
* flox: Properly propagate multiindex

Closes #9648

* skip test on old pandas

* small optimization

* fix
* Fix multiple grouping with missing groups

Closes #9360

* Small repr improvement

* Small optimization in mask

* Add whats-new

* fix doctests
…ests (#9651)

* Add close() method to DataTree and clean-up open files in tests

This removes a bunch of warnings that were previously issued in
unit-tests.

* Unit tests for closing functionality
…ap_blocks`` (#9658)

* Reduce graph size through writing indexes directly into graph for map_blocks

* Reduce graph size through writing indexes directly into graph for map_blocks

* Update xarray/core/parallel.py

---------

Co-authored-by: Deepak Cherian <[email protected]>
* Remove zarr pin

* Define zarr_v3 helper

* zarr-v3: filters / compressors -> codecs

* zarr-v3: update tests to avoid values equal to fillValue

* Various test fixes

* zarr_version fixes

* removed open_consolidated workarounds
* removed _store_version check
* pass through zarr_version

* fixup! zarr-v3: filters / compressors -> codecs

* fixup! fixup! zarr-v3: filters / compressors -> codecs

* fixup

* path / key normalization in set_variables

* fixes

* workaround nested consolidated metadata

* test: avoid fill_value

* test: Adjust call counts

* zarr-python 3.x Array.resize doesn't mutate

* test compatibility

- skip write_empty_chunks on 3.x
- update patch targets

* skip ZipStore with_mode

* test: more fill_value avoidance

* test: more fill_value avoidance

* v3 compat for instrumented test

* Handle zarr_version / zarr_format deprecation

* wip

* most Zarr tests passing

* unskip tests

* add custom Zarr _FillValue encoding / decoding

* relax dtype comparison in test_roundtrip_empty_vlen_string_array

* fix test_explicitly_omit_fill_value_via_encoding_kwarg

* fix test_append_string_length_mismatch_raises

* fix test_check_encoding_is_consistent_after_append for v3

* skip roundtrip_endian for zarr v3

* unskip datetimes and fix test_compressor_encoding

* unskip tests

* add back dtype skip

* point upstream to v3 branch

* Create temporary directory before using it

* Avoid zarr.storage.zip on v2

* fixed close_store_on_close bug

* Remove workaround, fixed upstream

* Restore original `w` mode.

* workaround for store closing with mode=w

* typing fixes

* compat

* Remove unnecessary pop

* fixed skip

* fixup types

* fixup types

* [test-upstream]

* Update install-upstream-wheels.sh

* set use_consolidated to false when user provides consolidated=False

* fix: import consolidated_metadata from package root

* fix: relax instrumented store checks for v3

* Adjust 2.18.3 thresholds

* skip datatree zarr tests w/ zarr 3 for now

* fixed kvstore usage

* typing fixes

* move zarr.codecs import

* fixup ignores

* storage options fix, skip

* fixed types

* Update ci/install-upstream-wheels.sh

* type fixes

* whats-new

* Update xarray/tests/test_backends_datatree.py

* fix type import

* set mapper, chunk_mapper

* Pass through zarr_format

* Fixup

* more cleanup

* revert test changes

* Update xarray/backends/zarr.py

* cleanup

* update docstring

* fix rtd

* tweak

---------

Co-authored-by: Ryan Abernathey <[email protected]>
Co-authored-by: Joe Hamman <[email protected]>
Co-authored-by: Deepak Cherian <[email protected]>
Co-authored-by: Deepak Cherian <[email protected]>
keewis and others added 9 commits October 24, 2024 11:15
* support chunking and default values in `open_groups`

* same for `open_datatree`

* use `group_subtrees` instead of `map_over_datasets`

* check that `chunks` on `open_datatree` works

* specify the chunksizes when opening from disk

* check that `open_groups` with chunks works, too

* require dask for `test_open_groups_chunks`

* protect variables from write operations

* copy over `_close` from the backend tree

* copy a lot of the docstring from `open_dataset`

* same for `open_groups`

* reuse `_protect_dataset_variables_inplace`

* final missing `requires_dask`

* typing for the test utils

Co-authored-by: Tom Nicholas <[email protected]>

* type hints for `_protect_datatree_variables_inplace`

Co-authored-by: Tom Nicholas <[email protected]>

* type hints for `_protect_dataset_variables_inplace`

* copy over the name of the backend tree

Co-authored-by: Tom Nicholas <[email protected]>

* typo

* swap the order of arguments to `assert_identical`

* try explicitly typing `data`

* typo

* use `Hashable` for variable names

---------

Co-authored-by: Tom Nicholas <[email protected]>
Co-authored-by: Tom Nicholas <[email protected]>
* implement `compute` and `load`

* also shallow-copy variables

* implement `chunksizes`

* add tests for `load`

* add tests for `chunksizes`

* improve the `load` tests using `DataTree.chunksizes`

* add a test for `compute`

* un-xfail a xpassing test

* implement and test `DataTree.chunk`

* link to `Dataset.load`

Co-authored-by: Tom Nicholas <[email protected]>

* use `tree.subtree` to get absolute paths

* filter out missing dims before delegating to `Dataset.chunk`

* fix the type hints for `DataTree.chunksizes`

* try using `self.from_dict` instead

* type-hint intermediate test variables

* use `_node_dims` instead

* raise on unknown chunk dim

* check that errors in `chunk` are raised properly

* adapt the docstrings of the new methods

* allow computing / loading unchunked trees

* reword the `chunksizes` properties

* also freeze the top-level chunk sizes

* also reword `DataArray.chunksizes`

* fix a copy-paste error

* same for `NamedArray.chunksizes`

---------

Co-authored-by: Tom Nicholas <[email protected]>
* use zarr v3 dimension_names

* Update xarray/backends/zarr.py

Co-authored-by: Stephan Hoyer <[email protected]>

* Update xarray/backends/zarr.py

Co-authored-by: Joe Hamman <[email protected]>

---------

Co-authored-by: Stephan Hoyer <[email protected]>
Co-authored-by: Joe Hamman <[email protected]>
Co-authored-by: Deepak Cherian <[email protected]>
Co-authored-by: Joe Hamman <[email protected]>
* adding draft for fixing behaviour for group parameter

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* new trial

* new trial

* fixing duplicate pahts and path in the root group

* removing yield str(gpath)

* implementing the proposed solution to hdf5 and netcdf backends

* adding changes to whats-new.rst

* removing encoding['source_group'] line to avoid conflicts with PR #9660

* adding test

* adding test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* adding             assert subgroup_tree.root.parent is None

* modifying tests

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update xarray/tests/test_backends_datatree.py

Co-authored-by: Justus Magin <[email protected]>

* applying suggested changes

* updating test

* adding Justus and Alfonso to the list of contributors to the DataTree entry

* adding Justus and Alfonso to the list of contributors to the DataTree entry

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tom Nicholas <[email protected]>
Co-authored-by: Justus Magin <[email protected]>
* check that the length of fixed-width numpy strings is reset

* drop the length from numpy's fixed-width string dtypes

* compatibility with `numpy<2`

* use `issubdtype` instead

* some more test cases

* more details in the comment

---------

Co-authored-by: Tom Nicholas <[email protected]>
* release summary

* move some notes to correct version
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.