Add xr.apply_ufunc wrapper to fit_gev #58

stellema · 2023-10-24T03:06:52Z

The fit_gev function is now wrapped with xr.apply_ufunc, so fitting data to a distribution (normal or non-stationary version) should work for multi-dim input.

If the input data is a 1D timerseries, the output will just be the expected values of the parameters:

shape, loc, scale = fit_gev(data)

Otherwise, the output will be a data array:

theta = fit_gev(data)
shape, loc, scale = theta.isel(lat=0, lon=0).values

The function assumes that the input data has time on the last axis (e.g., dims=(ensemble, lat, lon, time)) and infers the dimension name from that. However, I'm not sure if that is a common way of ordering dimensions.

Do you think it will be better to have a time_dim='time' input argument and infer the axis position from that or something?

Also, the overall function setup probably isn't great and it might be hard to find the actual docstring:

fit_gev = ns_genextreme_gen().fit

where ns_genextreme_gen().fit is just a wrapper for the actual fitting function,ns_genextreme_gen()._fit.

@DamienIrving and @dougiesquire any suggestions for improvements?

codecov-commenter · 2023-10-24T03:27:20Z

Codecov Report

Attention: 10 lines in your changes are missing coverage. Please review.

Comparison is base (c850bea) 31.97% compared to head (f794518) 38.13%.

Files	Patch %	Lines
unseen/general_utils.py	82.60%	8 Missing ⚠️
unseen/fileio.py	0.00%	2 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master      #58      +/-   ##
==========================================
+ Coverage   31.97%   38.13%   +6.15%     
==========================================
  Files          18       19       +1     
  Lines        1720     1820     +100     
==========================================
+ Hits          550      694     +144     
+ Misses       1170     1126      -44

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

DamienIrving · 2023-10-30T04:14:48Z

In terms of the order of dimensions, if the time dimension doesn't literally need to be last you could just get the axis number and use it:

time_index = data.dims.index('time')

If you do need time to be the last dimension then you could transpose?

dim_order = list(da.dims)
dim_order.remove('time')
dim_order.append('time')
data = data.transpose(*dim_order)

DamienIrving · 2023-10-30T04:51:55Z

In terms of the setup, our extreme value analysis code in the general_utils module is getting pretty substantial now (with your non-stationary stuff plus the event_in_context, return_period, gev_return_curve and plot_gev_return_curve) so we could possibly put all that in a new module (called eva.py or something).

With respect to finding the docstring, we'd want users to be able to run the following to get the detailed docstring:

from unseen import eva
help(eva.fit_gev)

Does everything need to be within the ns_genextreme_gen class, or could the fit method within the class actually be a function (with docstring) on its own (called fit_gev)?

stellema · 2023-11-01T23:09:29Z

Yeah, I think it might be a good idea to separate the eva functions. There won't be much left in general_utils, but at least the stat functions will be easier to find.

The function just needed the name of the time dimension, so I added that as a keyword in case it isn't time.

Yeah, the function doesn't need to be within the ns_genextreme_gen class. I just copied the setup of the scipy.stats.genextreme class. I changed it to be a function as you suggested, which fixes the docstring problem.

I also separated and vectorized the check_gev_fit function because it might be useful for the unit tests. I think that the function only works if the distribution parameters are in a data array, so I might have to fix that up. I might have to fix fit_gev so that if, for example, the parameters fail the goodness of fit test at some lat/lon points, then it will only try to re-fit the data at those points. I also need to add the scipy minimum version for stats.goodness_of_fit.

stellema · 2023-11-09T05:52:35Z

I'm having some issues with xr.apply_ufunc - if I don't specify output_sizes I get an error from xarray:

theta = apply_ufunc(_fit, data, input_core_dims=[[time_dim]], output_core_dims=[["theta"]], 
                    vectorize=True, dask="parallelized", kwargs=kwargs,
                    dask_gufunc_kwargs=dict(output_dtypes="float64"))

ValueError: dimension 'theta' in 'output_core_dims' needs corresponding (dim, size) in 'output_sizes'

But, if I do add it (in dask_gufunc_kwargs or not) then I get an error from dask:

theta = apply_ufunc(_fit, data, input_core_dims=[[time_dim]], output_core_dims=[["theta"]],
                    vectorize=True, dask="parallelized", kwargs=kwargs,
                    dask_gufunc_kwargs=dict(output_dtypes="float64", output_sizes={"theta": n}, 
                    meta=(np.ndarray(n, dtype='float64'))))

TypeError: xarray.core.daskmanager.DaskManager.apply_gufunc() got multiple values for keyword argument 'output_dtypes'

Any ideas why? The input function just returns a 1d array (e.g., np.array([shape, loc, scale]).

For scipy.stats.goodness_of_fit

Add some tests for fit_gev

Update apply_ufunc kw to dask="allowed". Add pvalue alpha to check_gev_fit.

Add non stationary gev fit tests

stellema · 2023-11-10T06:40:21Z

I seem to have fixed the previous apply_ufunc errors - they don't come up in any of the tests now.
The other error that I mentioned in the meeting is back - using 3d dask arrays as input to fit_gev gives this error:

import numpy as np
import xarray as xr
from scipy.stats import genextreme
from unseen.general_utils import fit_gev

# Create 3d data
time = np.arange("2000-01-01", "2002-01-01", dtype=np.datetime64)
lats = np.arange(0, 2)
lons = np.arange(0, 2)
shape = (len(lats), len(lons))

c = np.random.rand(*shape)
loc = np.random.rand(*shape) + np.random.randint(-10, 10, shape)
scale = np.random.rand(*shape)
theta_i = np.stack([c, loc, scale], axis=-1)

rvs = genextreme.rvs(
    c, loc=loc, scale=scale, size=(time.size, *shape), random_state=0
)
data = xr.DataArray(rvs, coords=[time, lats, lons], dims=["time", "lat", "lon"])

# Chunk data
data = data.chunk({"time": -1, "lat": 1, "lon": 1})

# Test fit
theta = fit_gev(data, stationary=True, check_fit=False)

ValueError: applied function returned data with an unexpected number of dimensions. Received 1 dimension(s) but expected 3 dimensions with names ('lat', 'lon', 'theta'), from:

array((0.5796542064710959, 4.557682364559278, 5.541214468170791), dtype=object)

The input data has dimensions (time, lat, lon) and the output should be (lat, lon, theta). The _fit function should take in a 1D timeseries for each location and give back a 1D array of parameters. Instead, it seems to be giving _fit a 3D array and expecting it to return another 3D array - similar to this issue.

Wrap fit_gev in xr.apply_ufunc

c19f46b

Update fit_gev and check_gev_fit

e653032

stellema added 4 commits November 10, 2023 10:58

Add scipy minimum version

81569f2

For scipy.stats.goodness_of_fit

Create test_eva.py

c281d00

Add some tests for fit_gev

Update general_utils.py

7f3471b

Update apply_ufunc kw to dask="allowed". Add pvalue alpha to check_gev_fit.

Update test_eva.py

dfc770a

Add non stationary gev fit tests

stellema and others added 7 commits November 16, 2023 11:47

Update test_eva assertions and update fit_gev return type

c249008

Update pre-commit.yml

8e756fe

Update .pre-commit-config.yaml

0c05943

Fix E721 lint error

12b7620

Update test_eva.py (fix nanosecond warning)

75ca7cf

Update fit_gev dask options and edit test_eva docstrings

59f7371

Add xclim min version to ci/environment.yml

f794518

stellema marked this pull request as ready for review November 17, 2023 02:17

stellema merged commit 26f5650 into AusClimateService:master Nov 17, 2023
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add xr.apply_ufunc wrapper to fit_gev #58

Add xr.apply_ufunc wrapper to fit_gev #58

stellema commented Oct 24, 2023

codecov-commenter commented Oct 24, 2023 •

edited

Loading

DamienIrving commented Oct 30, 2023

DamienIrving commented Oct 30, 2023

stellema commented Nov 1, 2023

stellema commented Nov 9, 2023

stellema commented Nov 10, 2023

Add xr.apply_ufunc wrapper to fit_gev #58

Add xr.apply_ufunc wrapper to fit_gev #58

Conversation

stellema commented Oct 24, 2023

codecov-commenter commented Oct 24, 2023 • edited Loading

Codecov Report

DamienIrving commented Oct 30, 2023

DamienIrving commented Oct 30, 2023

stellema commented Nov 1, 2023

stellema commented Nov 9, 2023

stellema commented Nov 10, 2023

codecov-commenter commented Oct 24, 2023 •

edited

Loading