Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to calculate climate metrics faster, in the case of mask nc data? #1545

Closed
1 task done
CGL5230 opened this issue Dec 1, 2023 · 7 comments
Closed
1 task done
Labels
support Questions and help for users/developers

Comments

@CGL5230
Copy link

CGL5230 commented Dec 1, 2023

Setup Information

  • Xclim version:0.46.0
  • Python version:3.11
  • Operating System:Linux

My original data was huge, 22G, and as a result caused my program run to crash frequently. It's a global precipitation dataset from ERA5 data pool.

My idea is to mask this data by my shape file and calculate climate indicators, which might be faster.
ERA5_pre_clip=pre_ERA5.rio.clip(gdf.geometry.apply(mapping),gdf.crs,drop=True,invert=False) ERA5_p95=xclim.core.calendar.percentile_doy(ERA5_pre_clip.tp,per=95,window=5)
image

But so far it doesn't seem to be much faster. I guess maybe the xclim can't avoid the NAN value?

Now I use this command.It is right? Ref from this
with xclim.set_options(check_missing="any"): ERA5_p95=xclim.core.calendar.percentile_doy(ERA5_pre_clip.tp,per=95,window=5)

Context

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@CGL5230 CGL5230 added the support Questions and help for users/developers label Dec 1, 2023
@Zeitsperre
Copy link
Collaborator

Hi @CGL5230,

I'm wondering if you are making proper use of Dask. All the indicators and indexes are optimized to make use of dask.array objects. These can be created by opening your NetCDF datasets in xarray with the chunks option.

In case you haven't come across this before: https://examples.dask.org/xarray.html

@aulemahal
Copy link
Collaborator

aulemahal commented Dec 1, 2023

I second the recommendation about using dask, IMO this is the first step to try. In addition to the good documentation linked above, you can look at the examples in xclim's doc.

The check_missing='any' option will not change the computation itself. It will only change the "post-processing" step. All indicators that peform some kind of resampling (i.e. functions under xclim.atmos) have this masking step that puts NaN for the periods where there were too many invalid values. In fact, check_missing='any' is already the default, it simply means that any NaN in a given period will propagate as a NaN in the output.
Also, percentile_doy is not an indicator, setting check_missing has no effect on this function.

So, simply masking your data by clipping areas of interest will not significantly accelerate the computation.

It seems however, that after clipping your data is extremely sparse. In this case you could pull the following trick. It will only work if the "mask" 1 is constant in time. The idea is you could collapse your 2D grid to a 1D one by dropping the points where there's never any data. This will reduce the size of the dataset greatly and should make everything faster. A

ERA5_pre_clip = pre_ERA5.rio.clip(gdf.geometry.apply(mapping), gdf.crs, drop=True, invert=False) 
# Put all lon and lat points into a single dimension, drop any spatial points where everything is NaN
ERA5_pre_clip_1d = ERA5_pre_clip.stack(site=['lat', 'lon']).dropna('site')

# make xclim computation
out_1d = blablabla

# Recover full grid
out_2d = out_1d.unstack('site').reindex(lat=pre_ERA5.lat, lon=pre_ERA5.lon)

I'll leave you with the details of this implementation, this is just an (untested!) idea.
Chances are that using dask will be easier and faster anyway.

Footnotes

  1. which spatial points are NaN and which have data

@CGL5230
Copy link
Author

CGL5230 commented Dec 1, 2023

That's great! I will try dask for calculations. I'm thinking the way of @aulemahal . The obvious convenient way is avioding the NAN value. That's will save lot of resource and time.
But I look through the dask example as your offer, It also great and easier.

@CGL5230
Copy link
Author

CGL5230 commented Dec 1, 2023

I want to know there is any experience for set chunk? As you can see the dataset is 5844 days(16 years). Maybe I set the chunk related to the number of year? Like chunks = {"time":365} ? But there is another **queston,Some years are not 365 days.Does this split affect the final result?
I set the chunk parameter When I open the dataset, Xclim will get this and start parallelize operations as the[ link]?(https://xclim.readthedocs.io/en/stable/notebooks/example.html#Optimizing-the-chunk-size)

chunks = {"time":365} ERA5_pre = xr.open_dataset('the path of the nc file',chunks=chunks)['tp'] ERA5_p95=xclim.core.calendar.percentile_doy(ERA5_pre.tp,per=95,window=5)

image

@CGL5230
Copy link
Author

CGL5230 commented Dec 2, 2023

I use those command to parallelize operations for xclim. But I don't know if this line of code is capable of parallel computation, because this computation still report the kernel died.

The code is
pre_ERA5_chunk = pre_ERA5.chunk({'time': -1, 'lon': 100, 'lat': 100}) wb_ERA5= xclim.indices.water_budget(pr=pre_ERA5_chunk.tp, tas =tas_ERA5.t2m, method="MB05")

@CGL5230
Copy link
Author

CGL5230 commented Dec 2, 2023

I use the suggestion of @aulemahal . That's will be faster. Drop the NAN value speed up the calculation for the climate index. Because the clip dataset is obviously sparse. Still now,I‘m not sure the dask works . But the problem is solved :) cheers!

I second the recommendation about using dask, IMO this is the first step to try. In addition to the good documentation linked above, you can look at the examples in xclim's doc.

The check_missing='any' option will not change the computation itself. It will only change the "post-processing" step. All indicators that peform some kind of resampling (i.e. functions under xclim.atmos) have this masking step that puts NaN for the periods where there were too many invalid values. In fact, check_missing='any' is already the default, it simply means that any NaN in a given period will propagate as a NaN in the output. Also, percentile_doy is not an indicator, setting check_missing has no effect on this function.

So, simply masking your data by clipping areas of interest will not significantly accelerate the computation.

It seems however, that after clipping your data is extremely sparse. In this case you could pull the following trick. It will only work if the "mask" 1 is constant in time. The idea is you could collapse your 2D grid to a 1D one by dropping the points where there's never any data. This will reduce the size of the dataset greatly and should make everything faster. A

ERA5_pre_clip = pre_ERA5.rio.clip(gdf.geometry.apply(mapping), gdf.crs, drop=True, invert=False) 
# Put all lon and lat points into a single dimension, drop any spatial points where everything is NaN
ERA5_pre_clip_1d = ERA5_pre_clip.stack(site=['lat', 'lon']).dropna('site')

# make xclim computation
out_1d = blablabla

# Recover full grid
out_2d = out_1d.unstack('site').reindex(lat=pre_ERA5.lat, lon=pre_ERA5.lon)

I'll leave you with the details of this implementation, this is just an (untested!) idea. Chances are that using dask will be easier and faster anyway.

Footnotes

  1. which spatial points are NaN and which have data

@CGL5230
Copy link
Author

CGL5230 commented Dec 8, 2023

Dear @aulemahal , I followed your suggestion. And I got the SPEI of cliped dataset. But when I go to clip again using a sub-area of the clip range, I get this error.NoDataInBounds: No data found in bounds. Data variable: __xarray_dataarray_variable__.
I think It should have data because I clip this dataset by the Larger cutting range. That's very confuse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
support Questions and help for users/developers
Projects
None yet
Development

No branches or pull requests

3 participants