Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usage on bias correcting extremes #1512

Closed
1 task done
mubali101 opened this issue Oct 26, 2023 · 10 comments
Closed
1 task done

Usage on bias correcting extremes #1512

mubali101 opened this issue Oct 26, 2023 · 10 comments
Labels
sdba Issues concerning the sdba submodule. support Questions and help for users/developers

Comments

@mubali101
Copy link

mubali101 commented Oct 26, 2023

Setup Information

  • Xclim version: 0.45.0
  • Python version: 3.10.9
  • Operating System: Linux

Context

Edit: I'm trying to understand the usage of xclim.sdba.adjustment.ExtremeValues by using the tasmax from the tutorial dataset.

  1. Does cluster_thresh means the q_thresh percentile is calculated on the values exceeding the cluster_thresh? And how to set different cluster_thresh for different locations?
  2. How should I modify the usage if I want to calibrate coldest Temperatures instead?
  3. I also saw that you have referenced Julia ClimateTools. Is the xclim implementation also automatically combining emperical quantile mapping and GPD-based corrections similar to Julia ClimateTools ?
import xarray as xr
import xclim
from xclim import sdba
from xclim.core.units import convert_units_to
from xclim.testing import open_dataset

# open tutorial temperature data
dref = open_dataset(
    "sdba/ahccd_1950-2013.nc", chunks={"location": 1}, 
    drop_variables=["lat", "lon"]).sel(time=slice("1980", "2010"))

dref = dref.drop_vars('pr')
dref = dref.assign(
    tasmax=convert_units_to(dref.tasmax, "K"),
)
dsim = open_dataset(
    "sdba/CanESM2_1950-2100.nc", chunks={"location": 1}, drop_variables=["lat", "lon", "pr"]
)

dhist = dsim.sel(time=slice("1981", "2010"))

# train GPD-based correction
ev = sdba.adjustment.ExtremeValues.train(ref=dref.tasmax, hist=dhist.tasmax, cluster_thresh='295 K', q_thresh=0.95)

# adjust
hist_bc = ev.adjust(scen=dref.tasmax, sim=dhist.tasmax)

Thank you!

Code of Conduct

  • I agree to follow this project's Code of Conduct
@mubali101 mubali101 added the support Questions and help for users/developers label Oct 26, 2023
@aulemahal
Copy link
Collaborator

Hi @mubali101 !

cluster_thresh

Yes, only values above cluster_thresh are used in the quantile computation with q_thresh.

Sadly, no, there is no way to specify a spatially-variant cluster_thresh. I think this would be possible to implement in xclim, it looks like a task of moderate complexity. We will be happy to guide and help you if you'd like to try it.

The alternative for the moment would be to loop over your points...

Lower extremes

Sounds dumb, but the ">=" is hardcoded everywhere. So again, there's no elegant way to calibrate low extremes.

Here's a funky alternative for your example: inverse the data before the correction and inverse the result.

# train GPD-based correction
ev = sdba.adjustment.ExtremeValues.train(ref= -dref.tasmax, hist= -dhist.tasmax, cluster_thresh='-265 K', q_thresh=0.95)

# adjust
hist_bc = -ev.adjust(scen=-dhist.tasmax, sim=-dhist.tasmax)

xclim doesnt care if it receives negative Kelvins. But make sure all your inputs (and cluster_thresh) have the same units (K) before doing that!

Combination

Yes, but you have to perform the EQM yourself before. Here's a modified version of your code:

import xarray as xr
import xclim
from xclim import sdba
from xclim.core.units import convert_units_to
from xclim.testing import open_dataset

# open tutorial temperature data
dref = open_dataset(
    "sdba/ahccd_1950-2013.nc", chunks={"location": 1}, 
    drop_variables=["lat", "lon", "pr"]
).sel(time=slice("1981", "2010"))

dref = dref.assign(
    tasmax=convert_units_to(dref.tasmax, "K"),
)
dsim = open_dataset(
    "sdba/CanESM2_1950-2100.nc", chunks={"location": 1}, drop_variables=["lat", "lon", "pr"]
)

dhist = dsim.sel(time=slice("1981", "2010"))

# Initial EQM
EQM = sdba.adjustment.EmpiricalQuantileMapping,.train(
    dref.tasmax, dhist.tasmax, nquantiles=50, kind='+', group='time'
)
hist_bc1 = EQM.adjust(dsim.tasmax)

# train GPD-based correction
ev = sdba.adjustment.ExtremeValues.train(
    ref=dref.tasmax, hist=dhist.tasmax, cluster_thresh='295 K', q_thresh=0.95
)

# adjust
hist_bc2 = ev.adjust(scen=hist_bc1, sim=dhist.tasmax) 

In your example, you were passing dref.tasmax to the scen argument of ev.adjust. This is incorrect : scen is the first-order bias-adjusted data and should have the exact same shape as sim. In this example, I perform a simple EQM as the first-order adjustment.

The power and frac arguments of adjust control the way the two timeseries (first-order and second-order) are merged, in a similar way as the julia code.

Hope that helps!

@RondeauG
Copy link
Contributor

RondeauG commented Oct 27, 2023

Hi! I'm one of the co-authors of this method. @aulemahal already did a good job at answering your questions. As he mentioned, the version of the code here in xclim only performs the correction of extreme values, so you'll need to call another bias ajustment method such as DQM or EQM prior to calling ExtremeValues.

A few relevant details:

  • This method was primarily designed for precipitation or other similar variables that have a strong distribution tail. We never tested it for temperatures, so you'll likely have to do additional verifications on your side to make sure that the statistical fit is adequate. This is also why >= was hard-coded, although the trick mentioned by @aulemahal should do the job.
  • Linked to the point above, the role of cluster_thresh was to eliminate drizzle before computing the 95th percentile (q_thresh), using cluster_thresh='1 mm d-1'. So in that instance, no spatial variation was warranted. I'm not sure what cluster_thresh should be for temperature... Maybe just 0°K, in order to include all data? Here too, you'll have to explore to find the better value for your variables.

EDIT: Another potential issue is that a multiplicative adjustment has been hard-coded, since it was designed for precipitation. This is probably not adequate for temperatures (we usually use additive correction factors for those). Here too, you'll probably want to look at what the function is doing to make sure the results are realistic.

@mubali101
Copy link
Author

mubali101 commented Oct 27, 2023

Hi @RondeauG @aulemahal, Thanks for your explanations, really appreciate it.

  • Flipping the data was also what I had in mind, good to have a confirmation on that.
  • cluster_thresh naming confused me with peak-over-threshold used in GPD to identify clusters. Maybe additional context can be added in the function help.
  • Thanks for highlighting the point on multiplicative adjustment. I can try to dig once more in the source code but I struggled last time because there are lot of nested function calls making it complex to understand for someone unfamiliar with the project. Do you have a separate channel if there are specific questions on the source code?

@aulemahal
Copy link
Collaborator

Indeed, after many optimization attemps, the code is now a bit complex and hard to follow...

In theory we have a gitter channel : https://app.gitter.im/#/room/#Ouranosinc_xclim:gitter.im. It's not very dynamic, but I think the core team will receive notifications.

@RondeauG
Copy link
Contributor

RondeauG commented Oct 27, 2023

Diving into the code a bit more, cluster_thresh as we have written it actually plays a double role that only really works for precipitation.

It is used once to subset data that is greater than the value (aka, remove drizzle with 1 mm d-1), but you are right in that it is also reused later when determining the clusters for the peak-over-threshold method, which is where the variable gets its name from. This is fine for precipitation, since it will drop back to 0 after an event, but this will not easily work for temperature. For it to work correctly for your use case, we'd need to separate it in 2.

@mubali101
Copy link
Author

mubali101 commented Nov 9, 2023

@RondeauG @aulemahal an update on this:

  • tried with negative T, and I started getting some error messages about NANs.
  • I switched to pyextremes for my use case. It is quite handy for EV-analysis, specially for checking sensitivity of shape, scale parameters, etc., and can fit GPD, GEV on both -- values exceeding or below a threshold.
  • However, it only works with pandas series. Maybe xclim can make use of it for extending GPD-based bias calibration.
  • The adjustment factors that you have implemented using Roy et al. 2023 seems quite clever. Are there any literature that discuss adjusting Temperature in a similar fashion? I found papers mostly on adjusting precipitation.
  • I still need to test if the exact adjustment as you have used could work for my case. But it will need some more digging into the source code from my side.

@RondeauG
Copy link
Contributor

Hi! I can't really comment on pyextremes or the feasibility of bringing it into xclim, but P.Roy tried to use GPD-based bias adjustment on temperature in an earlier version of our method. From what I can recall, he ended up putting it aside since the GPDs would sometimes create unphysical results (such as T° > 40°C in places where we know that such values can't happen). This could have been because of bugs or other issues that have been resolved since, but temperatures tend to follow normal distributions anyway and are much easier to bias correct using quantile mapping. I'd guess that this last part is why most papers that use extreme value analysis for bias correction also mainly work with precipitation.

@aulemahal
Copy link
Collaborator

I didn't know about pyextremes! It seems well coded and even it is pandas-based, wrapping it in sdba could be feasible. We would maybe lose some performance (mainly from the wrapping overhead and in comparison with our usage of numba), but we might gain more options. This would however necessitate work and the core team might not have time for it. We would be happy to help with a contributed PR!

I think it would mainly consist in wrapping the numpy functions of pyextremes with xr.apply_ufunc calls, all from within a map_groups/blocks of xclim.sdba.

@mubali101
Copy link
Author

mubali101 commented Nov 22, 2023

@aulemahal @RondeauG many thanks for your replies.

I tested pyextremes a bit more. Unfortunately a big drawback is that it only works with pandas datetime and one can't convert from cftime to pandas datetime when number of years exceed what pandas currently support.

On EQ mapping for temperature calibration, in my use case of calibrating colder extremes, the tail data calibration is quite important. EQ mapping seem to work fine for the tail using the following code:

# using 10K quantiles
QM = sdba.EmpiricalQuantileMapping.train(
ref=era5_data, hist=model_data, nquantiles=10000, group="time", kind="+"
)

# interp="cubic" doesn't differ for tails
QM.adjust(model_data, extrapolation="constant", interp="linear")
qm_model_adjusted = QM.adjust(model_data, extrapolation="constant", interp="linear")

But it is not clear to me how exactly it is calibrating the values that are more extreme than the reference data. Probably similar to this issue. Does extrapolation="constant" controls the tail extrapolation? The document points to xclim.sdba.utils.extrapolate_qm() but I can't find it in the utils.py file?

I have added a figure of Exceedance Probability plot from the calibrated data with different nquantile values used (100, 10K, 500K) with more extreme values on lower left, the orange curve being the reference data (ERA-5). It shows that tail calibration is quite decent as long as nquantiles is not too high, else the extremes will start clipping.

image

@aulemahal
Copy link
Collaborator

aulemahal commented Nov 22, 2023

Do you have access to pandas >= 2 ? Because, with this version you may be able to have datetimes exceeding the usual 1642-2260 limitations. I don't recall the actual conversion method, but it has to do with the "unit" property of the Timestamp object. When set to 's', the time is stored as an int64 of seconds elapsed since 1970-01-01, which allows years after 2260. Xarray might not support having a datetime axis of that type, but a DataFrame should be usable. It still doesn't support non-standard calendars though.

Indeed, the documentation needs an update. The extrapolation was moved to the factor interpolation function itself. In your case (no special grouping), that's here :

xclim/xclim/sdba/utils.py

Lines 323 to 327 in 473a3e0

if extrap == "constant":
fill_value = (
oldy[~np.isnan(oldy)][0],
oldy[~np.isnan(oldy)][-1],
)

With "constant", it will simply apply the nearest (last or first) adjustment factor.

With this in mind, I am not surprised that you get stranger results as you increase the number of quantiles. When there are "too many" quantiles, the factor are not a smooth-ish function anymore but they get very noisy. Then the last factor can be quite unstable.

@aulemahal aulemahal added the sdba Issues concerning the sdba submodule. label Jan 10, 2024
@Ouranosinc Ouranosinc locked and limited conversation to collaborators Jan 10, 2024
@aulemahal aulemahal converted this issue into discussion #1586 Jan 10, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
sdba Issues concerning the sdba submodule. support Questions and help for users/developers
Projects
None yet
Development

No branches or pull requests

3 participants