Skip to content

Commit

Permalink
Fix lazy indexing (#1484)
Browse files Browse the repository at this point in the history
<!--Please ensure the PR fulfills the following requirements! -->
<!-- If this is your first PR, make sure to add your details to the
AUTHORS.rst! -->
### Pull Request Checklist:
- [x] This PR addresses an already opened issue (for bug fixes /
features)
    - This PR fixes #1483
- [x] Tests for the changes have been added (for bug fixes / features)
- [ ] (If applicable) Documentation has been added / updated (for bug
fixes / features)
- [x] CHANGES.rst has been updated (with summary of main changes)
- [x] Link to issue (:issue:`number`) and pull request (:pull:`number`)
has been added

### What kind of change does this PR introduce?

* `lazy_indexing` will drop any auxiliary coords it find on the
intermediate array that was triggering computation. The output is not
affected.
* New `xclim.testing.helpers.assert_lazy`. A context manager to ensure
the code block is not triggering any computation. I might take time to
add it to other tests in another PR.
* ~Took the opportunity to fix a deprecation warning coming out of
`importlib.resources`. The way we opened and searched for module data
files was deprecated.~ Oups, I realized this deprecation concerns python
>= 3.9, but we still support python 3.8. This change doesn't justify
pinning a new python.

### Does this PR introduce a breaking change?
No.

### Other information:
  • Loading branch information
aulemahal committed Oct 11, 2023
2 parents 3e8a82b + 36f0a0c commit e53c125
Show file tree
Hide file tree
Showing 4 changed files with 31 additions and 4 deletions.
2 changes: 2 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ Bug fixes
* Coincidentally, this also fixes an error that caused `pytest` to error-out when invoked without an active internet connection. Running `pytest` without network access is now supported (requires cached testing data). (:issue:`1468`).
* Calling a ``sdba.map_blocks``-wrapped function with data chunked along the reduced dimensions will raise an error. This forbids chunking the trained dataset along the distribution dimensions, for example. (:issue:`1481`, :pull:`1482`).
* Optimization of indicators ``huglin_index`` and ``biologically_effective_degree_days`` when used with dask and flox. As a side effect, the indice functions (i.e. under ``xc.indices``) no longer mask incomplete periods. The indicators' output is unchanged under the default "check_missing" setting (:issue:`1494`, :pull:`1495`).
* Fixed ``xclim.indices.run_length.lazy_indexing`` which would sometimes trigger the loading of auxiliary coordinates. (:issue:`1483`, :pull:`1484`).

Breaking changes
^^^^^^^^^^^^^^^^
Expand All @@ -39,6 +40,7 @@ Internal changes
* Mastodon publishing now uses `chuhlomin/render-template <https://github.com/chuhlomin/render-template>`_ and a standard formatting markdown document to format Mastodon toots. (:pull:`1469`).
* GitHub testing workflows now use `Concurrency` instead of the styfle/cancel-workflow-action to cancel redundant workflows. (:pull:`1487`).
* The `pkg_resources` library has been replaced for the `packaging` library when version comparisons have been performed, and a few warning messages have been silenced in the testing suite. (:issue:`1489`, :pull:`1490`).
* New ``xclim.testing.helpers.assert_lazy`` context manager to assert the laziness of code blocks. (:pull:`1484`).

v0.45.0 (2023-09-05)
--------------------
Expand Down
7 changes: 6 additions & 1 deletion tests/test_run_length.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@

from xclim.core.options import set_options
from xclim.indices import run_length as rl
from xclim.testing.helpers import assert_lazy

K2C = 273.15

Expand Down Expand Up @@ -654,14 +655,18 @@ def test_too_many_dates(self, func, tas_series):
@pytest.mark.parametrize("use_dask", [True, False])
def test_lazy_indexing(use_dask):
idx = xr.DataArray([[0, 10], [33, 99]], dims=("x", "y"))
idx = idx.assign_coords(x2=idx.x**2)
da = xr.DataArray(np.arange(100), dims=("time",))
db = xr.DataArray(-np.arange(100), dims=("time",))

if use_dask:
idx = idx.chunk({"x": 1})

# Ensure tasks are different
outa, outb = compute(rl.lazy_indexing(da, idx), rl.lazy_indexing(db, idx))
with assert_lazy:
outa = rl.lazy_indexing(da, idx)
outb = rl.lazy_indexing(db, idx)
outa, outb = compute(outa, outb)

assert set(outa.dims) == {"x", "y"}
np.testing.assert_array_equal(idx, outa)
Expand Down
13 changes: 10 additions & 3 deletions xclim/indices/run_length.py
Original file line number Diff line number Diff line change
Expand Up @@ -1345,15 +1345,22 @@ def _index_from_1d_array(indices, array):
# The 0-D index case, we add a dummy dimension to help dask
dim = get_temp_dimname(da.dims, "x")
index = index.expand_dims(dim)
invalid = index.isnull() # Which indexes to mask
# Which indexes to mask.
invalid = index.isnull()
# NaN-indexing doesn't work, so fill with 0 and cast to int
index = index.fillna(0).astype(int)
# for each chunk of index, take corresponding values from da

da2 = da.rename("__placeholder__")
out = index.map_blocks(_index_from_1d_array, args=(da2,)).rename(da.name)
# mask where index was NaN
out = out.where(~invalid)
# mask where index was NaN. Drop any auxiliary coord, they are already on `out`.
# Chunked aux coord would have the same name on both sides and xarray will want to check if they are equal, which means loading them
# making lazy_indexing not lazy.
out = out.where(
~invalid.drop_vars(
[crd for crd in invalid.coords if crd not in invalid.dims]
)
)
if idx_ndim == 0:
# 0-D case, drop useless coords and dummy dim
out = out.drop_vars(da.dims[0]).squeeze()
Expand Down
13 changes: 13 additions & 0 deletions xclim/testing/helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
import numpy as np
import pandas as pd
import xarray as xr
from dask.diagnostics import Callback
from yaml import safe_load

from xclim.core import calendar
Expand Down Expand Up @@ -66,6 +67,7 @@
"PREFETCH_TESTING_DATA",
"TESTDATA_BRANCH",
"add_example_file_paths",
"assert_lazy",
"generate_atmos",
"populate_testing_data",
"test_timeseries",
Expand Down Expand Up @@ -253,3 +255,14 @@ def test_timeseries(
return da.to_dataset()
else:
return da


def _raise_on_compute(dsk: dict):
"""Raise an AssertionError mentionning the number triggered tasks."""
raise AssertionError(
f"Not lazy. Computation was triggered with a graph of {len(dsk)} tasks."
)


assert_lazy = Callback(start=_raise_on_compute)
"""Context manager that raises an AssertionError if any dask computation is triggered."""

0 comments on commit e53c125

Please sign in to comment.