`huglin_index` causes heavy dataset fragmentation #1494

fjetter · 2023-10-06T12:34:23Z

Generic Issue

xclim version: latest
Python version: 3.10
Operating System: any

Description

Disclaimer: I'm a dask engineer and don't have any knowledge about the domain. I don't know what a huglin_index is.

I'm currently investigating a performance issue with huglin_index (and possibly other functions as well).

Consider the following example input dataset

import xarray as xr
import dask.array as da
import pandas as pd
import xclim
import warnings

warnings.simplefilter(action="ignore", category=FutureWarning)

ds_fake = xr.Dataset({
    var: xr.DataArray(
            da.random.random((23741, 600, 1440), chunks=(23741, 15, 15)),
            dims=("time", "lat", "lon"),
            attrs={"units": "K"},
        )
        for var in ['tas', 'tasmax']
    },
     coords={
         "time": pd.date_range("1950", periods=23741, freq="D"),
         "lat": range(600)
     },
)
ds_fake.lat.attrs["units"] = "degrees_north"

which starts off with small but reasonable chunks

After the huglin_index is applied

xclim.set_options(data_validation='log')
xclim.atmos.huglin_index(tas=ds_fake.tas, tasmax=ds_fake.tasmax, lat=ds_fake.lat)

This generates a highly fragmented dataset of hundreds of thousands KiB size chunks

This heavy fragmentation causes this computation to not perform well since every task comes with a certain overhead that is typically much larger than what the processing of 1KiB of data requires (This is true even if this computation is not performed on a distributed cluster. working on a simple threadpool also comes with overhead).

I haven't found out where this fragmentation is coming from but I suspect that there is some internal step that can be optimized to reduce the number of tasks and keep intermediate (and output chunks) reasonable large.

From a dask perspective, it is also interesting that this is generating 500+ layers. This is not necessarily a problem but I suspect this is somehow related to the heavy fragementation.

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

aulemahal · 2023-10-06T13:49:59Z

Hi @fjetter and thanks for the issue!

This type of fragmentation is usual for climate indices of xclim because we use xarray's resample(time=...).map(...). If the particular operation being mapped isn't already optimized internally by xarray, then you get one chunk per output period. In the case of the huglin_index, the default value for freq is 'YS', thus you get one chunk per year. Furthermore, that particular index relies on an internal function (xclim.indices.generic.aggregate_between_dates) which explicitly loops over the resampling groups.

That being said, of course there is room for optimization in xclim! For most of the indexes, I would have redirected you towards xarray, because that's usually where the crunchy part occurs. And for resampling optimizations, that happens in flox. However, in this case, there might be something to do with aggregate_between_dates. Or at least, maybe there's a simpler implementation of the huglin_index possible without that complex function. I'll look into it.

But overall, unless a dask engineer comes in and saves the day, I'm not sure how we can change the behaviour you're seeing. I.e. how to perform a "resampling-map" with a generic function without fragmenting to 1 chunk per period.

For better performance, I would rather recommend using larger chunks in space (lat, lon) and smaller ones in time for your input data. This way, the inevitable fragmentation will have a smaller impact.

fjetter · 2023-10-06T14:05:31Z

Thanks for your detailed response!

But overall, unless a dask engineer comes in and saves the day, I'm not sure how we can change the behaviour you're seeing. I.e. how to perform a "resampling-map" with a generic function without fragmenting to 1 chunk per period.

I have to think about this a bit and look into it. My initial gut reaction is that this should be easy but that's often what I say before diving into something I should never have tried in the first place :)

I'll look into aggregate_between_dates as well to get a better understanding of what's going on.

or better performance, I would rather recommend using larger chunks in space (lat, lon) and smaller ones in time for your input data. This way, the inevitable fragmentation will have a smaller impact.

I think that's a good (and in hindsight obvious) intermediate step, thank you!

Zeitsperre · 2023-10-06T14:52:54Z

Hi @fjetter thanks for looking into this.

Briefly, the Huglin Index is one of many scoring systems for classifying winegrowing regional climate. These indexes are in xclim because of my previous research looking at climate change impacts on vineyards, and because coding these was good practice for me to learn the basics of xarray. This one was added a few years ago, so I'm not at all surprised that the implementation is less than optimal. It's likely that you'll find the same problems are present for the winkler_index and latitude_temperature_index as well.

Please don't hesitate to open a Pull Request (or draft) if you have some suggestions on how to better implement either the index or the generic components that build it up. We are fine with breaking changes if there is a clear improvement to be had.

All the best!

aulemahal mentioned this issue Oct 6, 2023

Optimize huglin and bedd for dask with flox #1495

Merged

5 tasks

Zeitsperre closed this as completed in #1495 Oct 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`huglin_index` causes heavy dataset fragmentation #1494

`huglin_index` causes heavy dataset fragmentation #1494

fjetter commented Oct 6, 2023

aulemahal commented Oct 6, 2023 •

edited

Loading

fjetter commented Oct 6, 2023

Zeitsperre commented Oct 6, 2023

huglin_index causes heavy dataset fragmentation #1494

huglin_index causes heavy dataset fragmentation #1494

Comments

fjetter commented Oct 6, 2023

Generic Issue

Description

Code of Conduct

aulemahal commented Oct 6, 2023 • edited Loading

fjetter commented Oct 6, 2023

Zeitsperre commented Oct 6, 2023

`huglin_index` causes heavy dataset fragmentation #1494

`huglin_index` causes heavy dataset fragmentation #1494

aulemahal commented Oct 6, 2023 •

edited

Loading