Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

huglin_index causes heavy dataset fragmentation #1494

Closed
1 task done
fjetter opened this issue Oct 6, 2023 · 3 comments · Fixed by #1495
Closed
1 task done

huglin_index causes heavy dataset fragmentation #1494

fjetter opened this issue Oct 6, 2023 · 3 comments · Fixed by #1495

Comments

@fjetter
Copy link

fjetter commented Oct 6, 2023

Generic Issue

  • xclim version: latest
  • Python version: 3.10
  • Operating System: any

Description

Disclaimer: I'm a dask engineer and don't have any knowledge about the domain. I don't know what a huglin_index is.

I'm currently investigating a performance issue with huglin_index (and possibly other functions as well).

Consider the following example input dataset

import xarray as xr
import dask.array as da
import pandas as pd
import xclim
import warnings

warnings.simplefilter(action="ignore", category=FutureWarning)

ds_fake = xr.Dataset({
    var: xr.DataArray(
            da.random.random((23741, 600, 1440), chunks=(23741, 15, 15)),
            dims=("time", "lat", "lon"),
            attrs={"units": "K"},
        )
        for var in ['tas', 'tasmax']
    },
     coords={
         "time": pd.date_range("1950", periods=23741, freq="D"),
         "lat": range(600)
     },
)
ds_fake.lat.attrs["units"] = "degrees_north"

which starts off with small but reasonable chunks

image

After the huglin_index is applied

xclim.set_options(data_validation='log')
xclim.atmos.huglin_index(tas=ds_fake.tas, tasmax=ds_fake.tasmax, lat=ds_fake.lat)

This generates a highly fragmented dataset of hundreds of thousands KiB size chunks

image

This heavy fragmentation causes this computation to not perform well since every task comes with a certain overhead that is typically much larger than what the processing of 1KiB of data requires (This is true even if this computation is not performed on a distributed cluster. working on a simple threadpool also comes with overhead).

I haven't found out where this fragmentation is coming from but I suspect that there is some internal step that can be optimized to reduce the number of tasks and keep intermediate (and output chunks) reasonable large.

From a dask perspective, it is also interesting that this is generating 500+ layers. This is not necessarily a problem but I suspect this is somehow related to the heavy fragementation.

Code of Conduct

  • I agree to follow this project's Code of Conduct
@aulemahal
Copy link
Collaborator

aulemahal commented Oct 6, 2023

Hi @fjetter and thanks for the issue!

This type of fragmentation is usual for climate indices of xclim because we use xarray's resample(time=...).map(...). If the particular operation being mapped isn't already optimized internally by xarray, then you get one chunk per output period. In the case of the huglin_index, the default value for freq is 'YS', thus you get one chunk per year. Furthermore, that particular index relies on an internal function (xclim.indices.generic.aggregate_between_dates) which explicitly loops over the resampling groups.

That being said, of course there is room for optimization in xclim! For most of the indexes, I would have redirected you towards xarray, because that's usually where the crunchy part occurs. And for resampling optimizations, that happens in flox. However, in this case, there might be something to do with aggregate_between_dates. Or at least, maybe there's a simpler implementation of the huglin_index possible without that complex function. I'll look into it.

But overall, unless a dask engineer comes in and saves the day, I'm not sure how we can change the behaviour you're seeing. I.e. how to perform a "resampling-map" with a generic function without fragmenting to 1 chunk per period.

For better performance, I would rather recommend using larger chunks in space (lat, lon) and smaller ones in time for your input data. This way, the inevitable fragmentation will have a smaller impact.

@fjetter
Copy link
Author

fjetter commented Oct 6, 2023

Thanks for your detailed response!

But overall, unless a dask engineer comes in and saves the day, I'm not sure how we can change the behaviour you're seeing. I.e. how to perform a "resampling-map" with a generic function without fragmenting to 1 chunk per period.

I have to think about this a bit and look into it. My initial gut reaction is that this should be easy but that's often what I say before diving into something I should never have tried in the first place :)

I'll look into aggregate_between_dates as well to get a better understanding of what's going on.

or better performance, I would rather recommend using larger chunks in space (lat, lon) and smaller ones in time for your input data. This way, the inevitable fragmentation will have a smaller impact.

I think that's a good (and in hindsight obvious) intermediate step, thank you!

@Zeitsperre
Copy link
Collaborator

Hi @fjetter thanks for looking into this.

Briefly, the Huglin Index is one of many scoring systems for classifying winegrowing regional climate. These indexes are in xclim because of my previous research looking at climate change impacts on vineyards, and because coding these was good practice for me to learn the basics of xarray. This one was added a few years ago, so I'm not at all surprised that the implementation is less than optimal. It's likely that you'll find the same problems are present for the winkler_index and latitude_temperature_index as well.

Please don't hesitate to open a Pull Request (or draft) if you have some suggestions on how to better implement either the index or the generic components that build it up. We are fine with breaking changes if there is a clear improvement to be had.

All the best!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants