How to calculate climate metrics faster, in the case of mask nc data? #1545

CGL5230 · 2023-12-01T14:01:31Z

Setup Information

Xclim version:0.46.0
Python version:3.11
Operating System:Linux

My original data was huge, 22G, and as a result caused my program run to crash frequently. It's a global precipitation dataset from ERA5 data pool.

My idea is to mask this data by my shape file and calculate climate indicators, which might be faster.
ERA5_pre_clip=pre_ERA5.rio.clip(gdf.geometry.apply(mapping),gdf.crs,drop=True,invert=False) ERA5_p95=xclim.core.calendar.percentile_doy(ERA5_pre_clip.tp,per=95,window=5)

But so far it doesn't seem to be much faster. I guess maybe the xclim can't avoid the NAN value?

Now I use this command.It is right? Ref from this
with xclim.set_options(check_missing="any"): ERA5_p95=xclim.core.calendar.percentile_doy(ERA5_pre_clip.tp,per=95,window=5)

Context

No response

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

Zeitsperre · 2023-12-01T14:28:51Z

Hi @CGL5230,

I'm wondering if you are making proper use of Dask. All the indicators and indexes are optimized to make use of dask.array objects. These can be created by opening your NetCDF datasets in xarray with the chunks option.

In case you haven't come across this before: https://examples.dask.org/xarray.html

aulemahal · 2023-12-01T14:49:53Z

I second the recommendation about using dask, IMO this is the first step to try. In addition to the good documentation linked above, you can look at the examples in xclim's doc.

The check_missing='any' option will not change the computation itself. It will only change the "post-processing" step. All indicators that peform some kind of resampling (i.e. functions under xclim.atmos) have this masking step that puts NaN for the periods where there were too many invalid values. In fact, check_missing='any' is already the default, it simply means that any NaN in a given period will propagate as a NaN in the output.
Also, percentile_doy is not an indicator, setting check_missing has no effect on this function.

So, simply masking your data by clipping areas of interest will not significantly accelerate the computation.

It seems however, that after clipping your data is extremely sparse. In this case you could pull the following trick. It will only work if the "mask" ¹ is constant in time. The idea is you could collapse your 2D grid to a 1D one by dropping the points where there's never any data. This will reduce the size of the dataset greatly and should make everything faster. A

ERA5_pre_clip = pre_ERA5.rio.clip(gdf.geometry.apply(mapping), gdf.crs, drop=True, invert=False) 
# Put all lon and lat points into a single dimension, drop any spatial points where everything is NaN
ERA5_pre_clip_1d = ERA5_pre_clip.stack(site=['lat', 'lon']).dropna('site')

# make xclim computation
out_1d = blablabla

# Recover full grid
out_2d = out_1d.unstack('site').reindex(lat=pre_ERA5.lat, lon=pre_ERA5.lon)

I'll leave you with the details of this implementation, this is just an (untested!) idea.
Chances are that using dask will be easier and faster anyway.

which spatial points are NaN and which have data ↩

CGL5230 · 2023-12-01T17:04:48Z

That's great! I will try dask for calculations. I'm thinking the way of @aulemahal . The obvious convenient way is avioding the NAN value. That's will save lot of resource and time.
But I look through the dask example as your offer, It also great and easier.

CGL5230 · 2023-12-01T17:22:41Z

I want to know there is any experience for set chunk? As you can see the dataset is 5844 days(16 years). Maybe I set the chunk related to the number of year? Like chunks = {"time":365} ? But there is another **queston,Some years are not 365 days.Does this split affect the final result?
I set the chunk parameter When I open the dataset, Xclim will get this and start parallelize operations as the[ link]?(https://xclim.readthedocs.io/en/stable/notebooks/example.html#Optimizing-the-chunk-size)

chunks = {"time":365} ERA5_pre = xr.open_dataset('the path of the nc file',chunks=chunks)['tp'] ERA5_p95=xclim.core.calendar.percentile_doy(ERA5_pre.tp,per=95,window=5)

CGL5230 · 2023-12-02T13:36:50Z

I use those command to parallelize operations for xclim. But I don't know if this line of code is capable of parallel computation, because this computation still report the kernel died.

The code is
pre_ERA5_chunk = pre_ERA5.chunk({'time': -1, 'lon': 100, 'lat': 100}) wb_ERA5= xclim.indices.water_budget(pr=pre_ERA5_chunk.tp, tas =tas_ERA5.t2m, method="MB05")

CGL5230 · 2023-12-02T15:38:12Z

I use the suggestion of @aulemahal . That's will be faster. Drop the NAN value speed up the calculation for the climate index. Because the clip dataset is obviously sparse. Still now,I‘m not sure the dask works . But the problem is solved :) cheers!

I second the recommendation about using dask, IMO this is the first step to try. In addition to the good documentation linked above, you can look at the examples in xclim's doc.

The check_missing='any' option will not change the computation itself. It will only change the "post-processing" step. All indicators that peform some kind of resampling (i.e. functions under xclim.atmos) have this masking step that puts NaN for the periods where there were too many invalid values. In fact, check_missing='any' is already the default, it simply means that any NaN in a given period will propagate as a NaN in the output. Also, percentile_doy is not an indicator, setting check_missing has no effect on this function.

So, simply masking your data by clipping areas of interest will not significantly accelerate the computation.

It seems however, that after clipping your data is extremely sparse. In this case you could pull the following trick. It will only work if the "mask" 1 is constant in time. The idea is you could collapse your 2D grid to a 1D one by dropping the points where there's never any data. This will reduce the size of the dataset greatly and should make everything faster. A
ERA5_pre_clip = pre_ERA5.rio.clip(gdf.geometry.apply(mapping), gdf.crs, drop=True, invert=False) 
# Put all lon and lat points into a single dimension, drop any spatial points where everything is NaN
ERA5_pre_clip_1d = ERA5_pre_clip.stack(site=['lat', 'lon']).dropna('site')

# make xclim computation
out_1d = blablabla

# Recover full grid
out_2d = out_1d.unstack('site').reindex(lat=pre_ERA5.lat, lon=pre_ERA5.lon)
I'll leave you with the details of this implementation, this is just an (untested!) idea. Chances are that using dask will be easier and faster anyway.

Footnotes

which spatial points are NaN and which have data ↩

CGL5230 · 2023-12-08T17:00:59Z

Dear @aulemahal , I followed your suggestion. And I got the SPEI of cliped dataset. But when I go to clip again using a sub-area of the clip range, I get this error.NoDataInBounds: No data found in bounds. Data variable: __xarray_dataarray_variable__.
I think It should have data because I clip this dataset by the Larger cutting range. That's very confuse.

CGL5230 added the support Questions and help for users/developers label Dec 1, 2023

CGL5230 mentioned this issue Dec 9, 2023

Can't return the standard variable name and reture NAN value When calculate the SPEI index using method ‘MB05’ #1552

Closed

CGL5230 closed this as completed Dec 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to calculate climate metrics faster, in the case of mask nc data? #1545

How to calculate climate metrics faster, in the case of mask nc data? #1545

CGL5230 commented Dec 1, 2023

Zeitsperre commented Dec 1, 2023

aulemahal commented Dec 1, 2023 •

edited

Loading

CGL5230 commented Dec 1, 2023

CGL5230 commented Dec 1, 2023

CGL5230 commented Dec 2, 2023 •

edited

Loading

CGL5230 commented Dec 2, 2023 •

edited

Loading

Footnotes

CGL5230 commented Dec 8, 2023 •

edited

Loading

How to calculate climate metrics faster, in the case of mask nc data? #1545

How to calculate climate metrics faster, in the case of mask nc data? #1545

Comments

CGL5230 commented Dec 1, 2023

Setup Information

Context

Code of Conduct

Zeitsperre commented Dec 1, 2023

aulemahal commented Dec 1, 2023 • edited Loading

Footnotes

CGL5230 commented Dec 1, 2023

CGL5230 commented Dec 1, 2023

CGL5230 commented Dec 2, 2023 • edited Loading

CGL5230 commented Dec 2, 2023 • edited Loading

Footnotes

CGL5230 commented Dec 8, 2023 • edited Loading

aulemahal commented Dec 1, 2023 •

edited

Loading

CGL5230 commented Dec 2, 2023 •

edited

Loading

CGL5230 commented Dec 2, 2023 •

edited

Loading

CGL5230 commented Dec 8, 2023 •

edited

Loading