-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
huglin_index
causes heavy dataset fragmentation
#1494
Comments
Hi @fjetter and thanks for the issue! This type of fragmentation is usual for climate indices of xclim because we use xarray's That being said, of course there is room for optimization in But overall, unless a dask engineer comes in and saves the day, I'm not sure how we can change the behaviour you're seeing. I.e. how to perform a "resampling-map" with a generic function without fragmenting to 1 chunk per period. For better performance, I would rather recommend using larger chunks in space ( |
Thanks for your detailed response!
I have to think about this a bit and look into it. My initial gut reaction is that this should be easy but that's often what I say before diving into something I should never have tried in the first place :) I'll look into
I think that's a good (and in hindsight obvious) intermediate step, thank you! |
Hi @fjetter thanks for looking into this. Briefly, the Huglin Index is one of many scoring systems for classifying winegrowing regional climate. These indexes are in xclim because of my previous research looking at climate change impacts on vineyards, and because coding these was good practice for me to learn the basics of Please don't hesitate to open a Pull Request (or draft) if you have some suggestions on how to better implement either the index or the generic components that build it up. We are fine with breaking changes if there is a clear improvement to be had. All the best! |
Generic Issue
Description
Disclaimer: I'm a dask engineer and don't have any knowledge about the domain. I don't know what a
huglin_index
is.I'm currently investigating a performance issue with
huglin_index
(and possibly other functions as well).Consider the following example input dataset
which starts off with small but reasonable chunks
After the
huglin_index
is appliedThis generates a highly fragmented dataset of hundreds of thousands
KiB
size chunksThis heavy fragmentation causes this computation to not perform well since every task comes with a certain overhead that is typically much larger than what the processing of 1KiB of data requires (This is true even if this computation is not performed on a distributed cluster. working on a simple threadpool also comes with overhead).
I haven't found out where this fragmentation is coming from but I suspect that there is some internal step that can be optimized to reduce the number of tasks and keep intermediate (and output chunks) reasonable large.
From a dask perspective, it is also interesting that this is generating 500+ layers. This is not necessarily a problem but I suspect this is somehow related to the heavy fragementation.
Code of Conduct
The text was updated successfully, but these errors were encountered: