esciencecenter-digital-skills · rogerkuou · May 8, 2024 · May 2, 2024 · May 2, 2024
diff --git a/episodes/10-zonal-statistics.md b/episodes/10-zonal-statistics.md
@@ -1,7 +1,7 @@
 ---
 title: "Calculating Zonal Statistics on Rasters"
 teaching: 40
-exercises: 20
+exercises: 0
 ---
 
 :::questions
@@ -17,159 +17,131 @@
 
 
 
-# Introduction
+## Introduction
 
-Statistics on predefined zones of the raster data are commonly used for analysis and to better understand the data. These zones are often provided within a single vector dataset, identified by certain vector attributes. For example, in the previous episodes, we used the crop field polygon dataset. The fields with the same crop type can be identified as a "zone", resulting in multiple zones in one vector dataset. One may be interested in performing statistical analysis over these crop zones.
+Statistics on predefined zones of the raster data are commonly used for analysis and to better understand the data. These zones are often provided within a single vector dataset, identified by certain vector attributes. For example, in the previous episodes, we defined infrastructure regions and built-up regions on Rhodes Island as polygons. Each region can be respectively identified as a "zone", resulting in two zones. One can evualuate the effect of the wild fire on the two zones by calculating the zonal statistics.
 
-In this episode, we will explore how to calculate zonal statistics based on the types of crops in `fields_cropped.shp`. To do this, we will first identify zones from the vector data, then rasterize these vector zones. Finally the zonal statistics for `ndvi` will be calculated over the rasterized zones.
 
+## Data loading
 
-# Making vector and raster data compatible
-First, let's load the `NDVI.tif` file saved in the previous episode to obtained our calculated raster `ndvi` data. We also use the `squeeze()` function in order to reduce our raster data `ndvi` dimension to 2D by removing the singular `band` dimension - this is necessary for use with the `rasterize` and `zonal_stats` functions:
+We have created `assets.gpkg` in Episode "Vector data in Python",  which contains the infrastructure regions and built-up regions . We also calculated the burned index in Episode "Raster Calculations in Python" and saved it in `burned.tif`. Lets load them:
 
 ```python
+# Load burned index
 import rioxarray
-ndvi = rioxarray.open_rasterio("NDVI.tif").squeeze()
-```
-
-Let's also read the crop fields vector data from our saved `fields_cropped.shp` file.
+burned = rioxarray.open_rasterio('burned.tif')
 
-```python
+# Load assests polygons
 import geopandas as gpd
-fields = gpd.read_file('fields_cropped.shp')
-```
-
-In order to use the vector data as a classifier for our raster, we need to convert the vector data to the appropriate CRS. We can perform the CRS conversion from the vector CRS (EPSG:28992) to our raster `ndvi` CRS (EPSG:32631) with:
-```python
-# Uniform CRS
-fields_utm = fields.to_crs(ndvi.rio.crs)
+assets = gpd.read_file('assets.gpkg')
 ```
 
-# Rasterizing the vector data
+## Align datasets
 
-Before calculating zonal statistics, we first need to rasterize our `fields_utm` vector geodataframe with the `rasterio.features.rasterize` function. With this function, we aim to produce a grid with numerical values representing the types of crops as defined by the column `gewascode` from `field_cropped` - `gewascode` stands for the crop codes as defined by the Netherlands Enterprise Agency (RVO) for different types of crop or `gewas` (Grassland, permanent; Grassland, temporary; corn fields; etc.). This grid of values thus defines the zones for the `xrspatial.zonal_stats` function, where each pixel in the zone grid overlaps with a corresponding pixel in our NDVI raster.
-
-We can generate the `geometry, gewascode` pairs for each vector feature to be used as the first argument to `rasterio.features.rasterize` as:
+Before we continue, let's check if the two datasets are in the same CRS:
 
 ```python
-geom = fields_utm[['geometry', 'gewascode']].values.tolist()
-geom
+print(assets.crs)
+print(burned.rio.crs)
 ```
 
 ```output
-[[<shapely.geometry.polygon.Polygon at 0x7ff88666f670>, 265],
- [<shapely.geometry.polygon.Polygon at 0x7ff86bf39280>, 265],
- [<shapely.geometry.polygon.Polygon at 0x7ff86ba1db80>, 265],
- [<shapely.geometry.polygon.Polygon at 0x7ff86ba1d730>, 265],
- [<shapely.geometry.polygon.Polygon at 0x7ff86ba1d400>, 265],
- [<shapely.geometry.polygon.Polygon at 0x7ff86ba1d130>, 265],
-...
- [<shapely.geometry.polygon.Polygon at 0x7ff88685c970>, 265],
- [<shapely.geometry.polygon.Polygon at 0x7ff88685c9a0>, 265],
- [<shapely.geometry.polygon.Polygon at 0x7ff88685c9d0>, 265],
- [<shapely.geometry.polygon.Polygon at 0x7ff88685ca00>, 331],
- ...]
+EPSG:4326
+EPSG:32635
 ```
 
-This generates a list of the shapely geometries from the `geometry` column, and the unique field ID from the `gewascode` column in the `fields_utm` geodataframe.
-
-We can now rasterize our vector data using `rasterio.features.rasterize`:
+The two datasets are in different CRS. Let's reproject the assets to the same CRS as the burned index raster:
 
 ```python
-from rasterio import features
-fields_rasterized = features.rasterize(geom, out_shape=ndvi.shape, transform=ndvi.rio.transform())
+assets = assets.to_crs(burned.rio.crs)
 ```
 
-The argument `out_shape` specifies the shape of the output grid in pixel units, while `transform` represents the projection from pixel space to the projected coordinate space. By default, the pixels that are not contained within a polygon in our shapefile will be filled with 0. It's important to pick a fill value that is not the same as any value already defined in `gewascode` or else we won't distinguish between this zone and the background.
+## Rasterize the vector data
 
-Let's inspect the results of rasterization:
+One way to define the zones, is to create a grid space with the same extent and resolution as the burned index raster, and with the numerical values in the grid representing the type of infrastructure, i.e., the zones. This can be done by rasterize the vector data `assets` to the grid space of `burned`.
+
+Let's first take two elements from `assets`, the geometry column, and the code of the region.
 
 ```python
-import numpy as np
-print(fields_rasterized.shape)
-print(np.unique(fields_rasterized))
+geom = assets[['geometry', 'code']].values.tolist()
+geom
 ```
 
 ```output
-(500, 500)
-[  0 259 265 266 331 332 335 863]
+[[<POLYGON ((569708.927 3972332.358, 569709.096 3972333.79, 569710.406 3972341...>,
+  1],
+ [<MULTIPOLYGON (((567767.095 3970740.732, 567767.548 3970741.604, 567772.083 ...>,
+  2]]
 ```
 
-The output `fields_rasterized` is an `np.ndarray` with the same shape as `ndvi`. It contains `gewascode` values at the location of fields, and 0 outside the fields. Let's visualize it:
+The raster image `burned` is a 3D image with a "band" dimension.
 
 ```python
-from matplotlib import pyplot as plt
-plt.imshow(fields_rasterized)
-plt.colorbar()
+burned.shape
 ```
 
-![](fig/E10/rasterization-results.png){alt="rasterization results"}
-
-We will convert the output to `xarray.DataArray` which will be used further. To do this, we will "borrow" the coordinates from `ndvi`, and fill in the rasterization data:
-```python
-import xarray as xr
-fields_rasterized_xarr = ndvi.copy()
-fields_rasterized_xarr.data = fields_rasterized
-
-# visualize
-fields_rasterized_xarr.plot(robust=True)
+```output
+(1, 1131, 1207)
 ```
 
-![](fig/E10/rasterization-results-xr.png){alt="Rasterization results Xarray"}
+To create the grid space, we only need the two spatial dimensions. We can used `.squeeze()` to drop the band dimension:
 
-# Calculate zonal statistics
+```python
+burned_squeeze = burned.squeeze()
+burned_squeeze.shape
+```
 
-In order to calculate the statistics for each crop zone, we call the function, `xrspatial.zonal_stats`. The `xrspatial.zonal_stats` function takes as input `zones`, a 2D `xarray.DataArray`, that defines different zones, and `values`, a 2D `xarray.DataArray` providing input values for calculating statistics.
+```output
+(1131, 1207)
+```
 
-We call the `zonal_stats` function with `fields_rasterized_xarr` as our classifier and the 2D raster with our values of interest `ndvi` to obtain the NDVI statistics for each crop type:
+Now we can use `features.rasterize` from `rasterio` to rasterize the vector data `assets` to the grid space of `burned`:
 
 ```python
-from xrspatial import zonal_stats
-zonal_stats(fields_rasterized_xarr, ndvi)
+from rasterio import features
+assets_rasterized = features.rasterize(geom, out_shape=burned_squeeze.shape, transform=burned.rio.transform())
+assets_rasterized
 ```
 
 ```output
-	zone	mean	max	min	sum	std	var	count
-0	0	0.266531	0.999579	-0.998648	38887.648438	0.409970	0.168075	145903.0
-1	259	0.520282	0.885242	0.289196	449.003052	0.111205	0.012366	863.0
-2	265	0.775609	0.925955	0.060755	66478.976562	0.091089	0.008297	85712.0
-3	266	0.794128	0.918048	0.544686	1037.925781	0.074009	0.005477	1307.0
-4	331	0.703056	0.905304	0.142226	10725.819336	0.102255	0.010456	15256.0
-5	332	0.681699	0.849158	0.178113	321.080261	0.123633	0.015285	471.0
-6	335	0.648063	0.865804	0.239661	313.662598	0.146582	0.021486	484.0
-7	863	0.388575	0.510572	0.185987	1.165724	0.144245	0.020807	3.0
+array([[0, 0, 0, ..., 0, 0, 0],
+       [0, 0, 0, ..., 0, 0, 0],
+       [0, 0, 0, ..., 0, 0, 0],
+       ...,
+       [0, 0, 0, ..., 0, 0, 0],
+       [0, 0, 0, ..., 0, 0, 0],
+       [0, 0, 0, ..., 0, 0, 0]], dtype=uint8)
 ```
 
-The `zonal_stats` function calculates the minimum, maximum, and sum for each zone along with statistical measures such as the mean, variance and standard deviation for each rasterized vector zone. In our raster dataset `zone = 0`, corresponding to non-crop areas, has the highest count followed by `zone = 265` which corresponds to 'Grasland, blijvend' or 'Grassland, permanent'. The highest mean NDVI is observed for `zone = 266` for 'Grasslands, temporary' with the lowest mean, aside from non-crop area, going to `zone = 863` representing 'Forest without replanting obligation'. Thus, the `zonal_stats` function can be used to analyze and understand different sections of our raster data. The definition of the zones can be derived from vector data or from classified raster data as presented in the challenge below:
+## Perform zonal statistics
 
-:::challenge
-## Exercise: Calculate zonal statistics for zones defined by `ndvi_classified`
+The rasterized zones `assets_rasterized` is a `numpy` array. The Python package `xrspatial`, which is the one we will use for zoning statistics, accepts `xarray.DataArray`. We need to first convert  `assets_rasterized`. We can use `burned_squeeze` as a template:
 
-Let's calculate NDVI zonal statistics for the different zones as classified by `ndvi_classified` in the previous episode.
+```python
+assets_rasterized_xarr = burned_squeeze.copy()
+assets_rasterized_xarr.data = assets_rasterized
+assets_rasterized_xarr.plot()
+```
 
-Load both raster datasets: `NDVI.tif` and `NDVI_classified.tif`. Then, calculate zonal statistics for each `class_bins`. Inspect the output of the `zonal_stats` function.
+![](../fig/E10/zones_rasterized_xarray.png){alt="Rasterized zones"}
 
+Then we can calculate the zonal statistics using the `zonal_stats` function:
 
-::::solution
-1) Load and convert raster data into suitable inputs for `zonal_stats`:
-```python
-ndvi = rioxarray.open_rasterio("NDVI.tif").squeeze()
-ndvi_classified = rioxarray.open_rasterio("NDVI_classified.tif").squeeze()
-```
-2) Create and display the zonal statistics table.
 ```python
-zonal_stats(ndvi_classified, ndvi)
+from xrspatial import zonal_stats
+stats = zonal_stats(assets_rasterized_xarr, burned_squeeze)
+stats
 ```
 
 ```output
-	zone    mean       max        min           sum         std      var	  count  
-0     1  -0.355660  -0.000257  -0.998648  -12838.253906  0.145916  0.021291  36097.0  
-1     2   0.110731   0.199839   0.000000    1754.752441  0.055864  0.003121  15847.0  
-2     3   0.507998   0.700000   0.200000   50410.167969  0.140193  0.019654  99233.0  
-3     4   0.798281   0.999579   0.700025   78888.523438  0.051730  0.002676  98823.0
+    zone	  mean	    max	min	sum	    std	      var	      count
+0	    0	    0.022570	1.0	0.0	28929.0	0.148528	0.022061	1281749.0
+1	    1	    0.009607	1.0	0.0	568.0	  0.097542	0.009514	59125.0
+2	    2	    0.000000	0.0	0.0	0.0	    0.000000	0.000000	24243.0
 ```
-::::
-:::
+
+The results provide statistics for three zones: `1` represents infrastructure regions, `2` represents built-up regions, and `0` represents the rest of the area.
+
 
 :::keypoints
 - Zones can be extracted by attribute columns of a vector dataset

diff --git a/episodes/fig/E10/rasterization-results-xr.png b/episodes/fig/E10/rasterization-results-xr.png
diff --git a/episodes/fig/E10/rasterization-results.png b/episodes/fig/E10/rasterization-results.png
diff --git a/episodes/fig/E10/zones_rasterized_xarray.png b/episodes/fig/E10/zones_rasterized_xarray.png