Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add xr.apply_ufunc wrapper to fit_gev #58

Merged
merged 13 commits into from
Nov 17, 2023

Conversation

stellema
Copy link
Member

The fit_gev function is now wrapped with xr.apply_ufunc, so fitting data to a distribution (normal or non-stationary version) should work for multi-dim input.

If the input data is a 1D timerseries, the output will just be the expected values of the parameters:

shape, loc, scale = fit_gev(data)

Otherwise, the output will be a data array:

theta = fit_gev(data)
shape, loc, scale = theta.isel(lat=0, lon=0).values

The function assumes that the input data has time on the last axis (e.g., dims=(ensemble, lat, lon, time)) and infers the dimension name from that. However, I'm not sure if that is a common way of ordering dimensions.

Do you think it will be better to have a time_dim='time' input argument and infer the axis position from that or something?

Also, the overall function setup probably isn't great and it might be hard to find the actual docstring:

fit_gev = ns_genextreme_gen().fit

where ns_genextreme_gen().fit is just a wrapper for the actual fitting function,ns_genextreme_gen()._fit.

@DamienIrving and @dougiesquire any suggestions for improvements?

@codecov-commenter
Copy link

codecov-commenter commented Oct 24, 2023

Codecov Report

Attention: 10 lines in your changes are missing coverage. Please review.

Comparison is base (c850bea) 31.97% compared to head (f794518) 38.13%.

Files Patch % Lines
unseen/general_utils.py 82.60% 8 Missing ⚠️
unseen/fileio.py 0.00% 2 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master      #58      +/-   ##
==========================================
+ Coverage   31.97%   38.13%   +6.15%     
==========================================
  Files          18       19       +1     
  Lines        1720     1820     +100     
==========================================
+ Hits          550      694     +144     
+ Misses       1170     1126      -44     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@DamienIrving
Copy link
Member

In terms of the order of dimensions, if the time dimension doesn't literally need to be last you could just get the axis number and use it:

time_index = data.dims.index('time')

If you do need time to be the last dimension then you could transpose?

dim_order = list(da.dims)
dim_order.remove('time')
dim_order.append('time')
data = data.transpose(*dim_order)

@DamienIrving
Copy link
Member

In terms of the setup, our extreme value analysis code in the general_utils module is getting pretty substantial now (with your non-stationary stuff plus the event_in_context, return_period, gev_return_curve and plot_gev_return_curve) so we could possibly put all that in a new module (called eva.py or something).

With respect to finding the docstring, we'd want users to be able to run the following to get the detailed docstring:

from unseen import eva
help(eva.fit_gev)

Does everything need to be within the ns_genextreme_gen class, or could the fit method within the class actually be a function (with docstring) on its own (called fit_gev)?

@stellema
Copy link
Member Author

stellema commented Nov 1, 2023

Yeah, I think it might be a good idea to separate the eva functions. There won't be much left in general_utils, but at least the stat functions will be easier to find.

The function just needed the name of the time dimension, so I added that as a keyword in case it isn't time.

Yeah, the function doesn't need to be within the ns_genextreme_gen class. I just copied the setup of the scipy.stats.genextreme class. I changed it to be a function as you suggested, which fixes the docstring problem.

I also separated and vectorized the check_gev_fit function because it might be useful for the unit tests. I think that the function only works if the distribution parameters are in a data array, so I might have to fix that up. I might have to fix fit_gev so that if, for example, the parameters fail the goodness of fit test at some lat/lon points, then it will only try to re-fit the data at those points. I also need to add the scipy minimum version for stats.goodness_of_fit.

@stellema
Copy link
Member Author

stellema commented Nov 9, 2023

I'm having some issues with xr.apply_ufunc - if I don't specify output_sizes I get an error from xarray:

theta = apply_ufunc(_fit, data, input_core_dims=[[time_dim]], output_core_dims=[["theta"]], 
                    vectorize=True, dask="parallelized", kwargs=kwargs,
                    dask_gufunc_kwargs=dict(output_dtypes="float64"))
ValueError: dimension 'theta' in 'output_core_dims' needs corresponding (dim, size) in 'output_sizes'

But, if I do add it (in dask_gufunc_kwargs or not) then I get an error from dask:

theta = apply_ufunc(_fit, data, input_core_dims=[[time_dim]], output_core_dims=[["theta"]],
                    vectorize=True, dask="parallelized", kwargs=kwargs,
                    dask_gufunc_kwargs=dict(output_dtypes="float64", output_sizes={"theta": n}, 
                    meta=(np.ndarray(n, dtype='float64'))))
TypeError: xarray.core.daskmanager.DaskManager.apply_gufunc() got multiple values for keyword argument 'output_dtypes'

Any ideas why? The input function just returns a 1d array (e.g., np.array([shape, loc, scale]).

For scipy.stats.goodness_of_fit
Add some tests for fit_gev
Update apply_ufunc kw to dask="allowed". Add pvalue alpha to check_gev_fit.
Add non stationary gev fit tests
@stellema
Copy link
Member Author

I seem to have fixed the previous apply_ufunc errors - they don't come up in any of the tests now.
The other error that I mentioned in the meeting is back - using 3d dask arrays as input to fit_gev gives this error:

import numpy as np
import xarray as xr
from scipy.stats import genextreme
from unseen.general_utils import fit_gev

# Create 3d data
time = np.arange("2000-01-01", "2002-01-01", dtype=np.datetime64)
lats = np.arange(0, 2)
lons = np.arange(0, 2)
shape = (len(lats), len(lons))

c = np.random.rand(*shape)
loc = np.random.rand(*shape) + np.random.randint(-10, 10, shape)
scale = np.random.rand(*shape)
theta_i = np.stack([c, loc, scale], axis=-1)

rvs = genextreme.rvs(
    c, loc=loc, scale=scale, size=(time.size, *shape), random_state=0
)
data = xr.DataArray(rvs, coords=[time, lats, lons], dims=["time", "lat", "lon"])

# Chunk data
data = data.chunk({"time": -1, "lat": 1, "lon": 1})

# Test fit
theta = fit_gev(data, stationary=True, check_fit=False)
ValueError: applied function returned data with an unexpected number of dimensions. Received 1 dimension(s) but expected 3 dimensions with names ('lat', 'lon', 'theta'), from:

array((0.5796542064710959, 4.557682364559278, 5.541214468170791), dtype=object)

The input data has dimensions (time, lat, lon) and the output should be (lat, lon, theta). The _fit function should take in a 1D timeseries for each location and give back a 1D array of parameters. Instead, it seems to be giving _fit a 3D array and expecting it to return another 3D array - similar to this issue.

@stellema stellema marked this pull request as ready for review November 17, 2023 02:17
@stellema stellema merged commit 26f5650 into AusClimateService:master Nov 17, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants