Docs indicators and conventions (#379)

* Updates to hazard indicator docs Signed-off-by: Joe Moorhouse <[email protected]> * Edit of hazard indicator documentation. Signed-off-by: Joe Moorhouse <[email protected]> * Update hazard indicator list Signed-off-by: Joe Moorhouse <[email protected]> --------- Signed-off-by: Joe Moorhouse <[email protected]>
os-climate · Jan 6, 2025 · 2c354bf · 2c354bf
1 parent 638bc9f
commit 2c354bf
Show file tree

Hide file tree

Showing 3 changed files with 127 additions and 36 deletions.
diff --git a/docs/user_guide/hazard_indicators/indicators.ipynb b/docs/user_guide/hazard_indicators/indicators.ipynb
@@ -4,52 +4,129 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Hazard indicators and conventions\n",
+    "# Hazard indicator conventions\n",
     "\n",
-    "Note that the xarray library already defines a convention for storing DataArrays as Zarr arrays.\n",
-    "https://docs.xarray.dev/en/latest/generated/xarray.Dataset.to_zarr.html\n",
-    "In this convention, the \n",
+    "## Introduction\n",
+    "As discussed briefly in the introduction to Physrisk, a hazard indicator is a measure that quantifies a hazard. Indicators are typically functions of location (e.g. latitude/longitude), climate scenario (e.g. SSP585) and time (e.g. 2080 projection); that is, hazard indicators are typically available for the current climate and for future climate under different possible pathways — or at least this is true of the data sets most useful for assessing the impact of climate change. The hazard indicators of particular interest are those used by vulnerability models. In Physrisk, ```VulnerabilityModel``` instances request hazard indicators from ```HazardModel``` instances. A ```HazardModel``` must implement a ```get_hazard_data``` method. In their implementation, hazard models may source indicator data via API calls — typically where a commercial API is used, but also as a way to improve performance — or by looking up data stored in raster or shapefile format or indeed looking up data via a database key (e.g. indexing based on Quadbin, H3 or similar). Hazard indicator data, both public domain and commercial, is probably most commonly stored/distributed in raster format. Efficient access of raster data is therefore an important topic.\n",
     "\n",
-    "https://docs.xarray.dev/en/latest/user-guide/terminology.html#term-Indexed-coordinate"
+    "## Raster data\n",
+    "Raster data comprises a matrix of cells, or pixels, organized into a grid of rows and columns where each cell contains a value representing information such temperature, precipitation, wind speed etc. More generally each pixel can itself represent an array of values and the structure becomes a multi-dimensional array. Raster data can arise:  \n",
+    "- from the data products of Earth Observing (EO) instruments with pixel-based sensor arrays, or  \n",
+    "- from models that discretize space into pixels (e.g. finite element models), or  \n",
+    "- from data derived from other rasterized data.  \n",
+    "\n",
+    "Hazard indicators commonly make use of EO datasets and the outputs of climate models, hence the prevalence of raster data. It is important also to note that the pixels are typically square (in some coordinate reference system), as opposed to hexagonal etc: if the intent is to make data available without re-interpolation, handing such square pixels is important.\n",
+    "Raster data can be large. For example the circumference of the Earth is 40,075 km, so a global raster with 1 km resolution at the Equator may already have something like 40,000 × 20,000 pixels — and note that some flood model data sets are 5 m or even 1 m resolution. For this reason, raster data is often chunked and compressed. 'Chunking' means that the raster grid is split into smaller grids and these smaller grids are compressed. This is efficient if the intent is to access one specific region without having to read and decompress the entire data set. The problem of dealing with raster data is often then one of dealing with *chunked, compressed, N-dimensional arrays*.\n",
+    "\n",
+    "In Physrisk, hazards indicator raster data is stored using the Zarr format (see https://zarr.readthedocs.io/en/stable/). Cloud Optimized GeoTIFF (https://cogeo.org/) is another format with much the same facility for dealing with chunked, compressed N-dimensional data and indeed [there are other choices](https://guide.cloudnativegeo.org/zarr/intro.html). Zarr was ultimately chosen for its ability to store natively hierarchical multidimensional array data in a cloud-optimized way and for the flexibility of its [back-ends stores](https://zarr.readthedocs.io/en/stable/api/storage.html). Notably, Zarr chunks are stored in separate files or objects which means that Zarr arrays can be written in parallel. \n",
+    "\n",
+    "In the next sections, we discuss the conventions for storing raster hazard indicator data.\n",
+    "\n",
+    "## Overview of structure\n",
+    "Note: in the following, the terminology is that of Xarray data structure documentation; see https://docs.xarray.dev/en/latest/user-guide/data-structures.html.\n",
+    "\n",
+    "Hazard indicators are stored in three-dimensional arrays. The array dimensions are $(z, y, x)$ where $y$ is the spatial $y$ coordinate ('latitude' or 'y'), $x$ is the spatial $x$ coordinate ('longitude' or 'x') and $z$ is an *index* coordinate. This index coordinate takes on different meanings according to the type of data being stored, for example: \n",
+    "\n",
+    "- Return periods, e.g. in the case of acute hazards. For example the coordinate labels might be 5, 50, 100, 200 years, the indicator value being be maximum sustained wind speed for that return period.\n",
+    "- Threshold values, e.g. in the case of chronic hazards. For example the coordinate labels might be 20°, 30°, 40°, 50°, the indicator value being the mean number of days with maximum daily temperature greater than the threshold.\n",
+    "\n",
+    "The labels of the index coordinate are given by the 'index_values' attribute of the hazard indicator array; and the name of the coordinate by 'index_name'. The name 'index' derives from the XArray definition of index: a data structure used to extract data using coordinate labels instead of integer array indices because the index coordinate labels the set of two dimensional layers that make up the hazard indicator array.\n",
+    "\n",
+    "For example, in the first case above ```index_values``` is given by the array ```[1, 50, 100, 200]```. and ```index_name``` is ```\"return period (years)\"```.\n",
+    "\n",
+    "The spatial coordinates are defined by a:  \n",
+    "- Coordinate reference system  \n",
+    "- Affine transform (see, e.g. https://affine.readthedocs.io/en/latest/#affine)  \n",
+    "\n",
+    "The Zarr default is to use the 'C' ordering convention, i.e. the value of array ```data``` at ```data[k, j, i]``` is at address $i + N_i \\times j + N_i \\times N_j \\times k$. That is, value [i, j, k] and [i, k, k + 1] are adjacent in memory. This tends to give the best compression ratios in various important cases, for example flood depths at different return periods for a region with zero flood depth below a certain threshold — large regions of the chunk will be zero.\n",
+    "\n",
+    "The coordinates of the array are specified by the set of attributes:\n",
+    "\n",
+    "- ```crs```: the coordinate reference system (e.g. ```\"EPSG:4326\"```).\n",
+    "- ```transform_mat3x3```: an array ```[a, b, c, d, e, f]``` specifying the affine transform, following the conventions of the [Affine library](https://affine.readthedocs.io/en/latest/#affine). This transforms points specified in the coordinate reference system to points in pixel space.\n",
+    "- ```index_name```: name of the index coordinate: the non-spatial dimension.\n",
+    "- ```index_values```: the labels corresponding to each of the integer indices of the non-spatial dimension.\n",
+    "\n",
+    "\n",
+    "## Relationship with Xarray\n",
+    "XArray ```xarray.DataArray``` defines a property ```coords```: a dict-like container of arrays (coordinates) that label each point (e.g., one-dimensional arrays of numbers, datetime objects or strings). That is, the coordinates of the three dimensions are given by one-dimensional arrays. This means that to create an ```xrray.DataArray```, one must create these arrays from the Zarr attributes. This is currently done in the ```hazard``` repo, although the functionality is sufficiently useful that it will be added to Physrisk also.\n",
+    "\n",
+    "See https://github.com/os-climate/hazard/blob/main/src/hazard/utilities/xarray_utilities.py\n",
+    "\n",
+    "\n",
+    "### Why not use the XArray Zarr convention?\n",
+    "XArray provides [native support for writing to Zarr](https://docs.xarray.dev/en/latest/generated/xarray.Dataset.to_zarr.html) and it is always desirable to follow existing conventions! However, the ```xrray.DataArray``` is [written as a Zarr group with the coordinates as separate Zarr arrays](https://docs.xarray.dev/en/v2024.06.0/user-guide/io.html). That is the co-ordinates are expanded in full and must be read prior to indexing the Zarr array. Moreover, if an affine transform is given, the indices of a pixel can be calculated directly from this transform: it is not necessary to perform a look-up from the coordinate arrays (presumably by bisection). In principle, it is possible to support both schemes, but arguably more confusing. In any case, at time of writing we keep with the convention that a hazard indicator array is a plain Zarr array (i.e. not a group), with the additional attributes above, but we try to make it straight forward to convert from one format to another because there is certainly a clarity in having the coordinates made explicitly available. \n"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": 18,
+   "cell_type": "markdown",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "import os\n",
-    "from dotenv import load_dotenv\n",
-    "import s3fs\n",
-    "\n",
-    "load_dotenv(\"../../../credentials.env\")\n",
-    "\n",
-    "bucket = os.environ.get(\"OSC_S3_BUCKET\")\n",
-    "s3 = s3fs.S3FileSystem(\n",
-    "    key=os.environ.get(\"OSC_S3_ACCESS_KEY\", None),\n",
-    "    secret=os.environ.get(\"OSC_S3_SECRET_KEY\", None),\n",
-    ")"
+    "## An aside on 'C'-like ordering\n",
+    "[Numpy arrays can help to clarify the point]"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 19,
+   "execution_count": 22,
    "metadata": {},
    "outputs": [
     {
-     "data": {
-      "text/plain": [
-       "['physrisk-hazard-indicators/hazard', 'physrisk-hazard-indicators/hazard_test']"
-      ]
-     },
-     "execution_count": 19,
-     "metadata": {},
-     "output_type": "execute_result"
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Numpy also has 'C' ordering by default\n",
+      "[[[ 0  1  2  3]\n",
+      "  [ 4  5  6  7]\n",
+      "  [ 8  9 10 11]]\n",
+      "\n",
+      " [[12 13 14 15]\n",
+      "  [16 17 18 19]\n",
+      "  [20 21 22 23]]]\n",
+      "Element [0, 2, 3] = 11 is next to element [0, 2, 2] = 10 in memory\n"
+     ]
     }
    ],
    "source": [
-    "s3.ls(bucket)"
+    "import numpy as np\n",
+    "\n",
+    "print(f\"Numpy also has 'C' ordering by default\") \n",
+    "a = np.arange(24)\n",
+    "b = a.reshape([2, 3, 4])\n",
+    "print(b)\n",
+    "print(f\"Element [0, 2, 3] = {b[0, 2, 3]} is next to element [0, 2, 2] = {b[0, 2, 2]} in memory\") "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Performance\n",
+    "In physical climate risk calculations a common use case is that we have a portfolio of assets and we want to retrieve hazard indicator values for each asset location, perhaps with some buffer or other shape around the location, but perhaps treating assets as point-like. How well do chunked, compressed arrays do in such cases? First of all, the performance depends on two important factors:  \n",
+    "- how the data is chunked, and  \n",
+    "- how the data is stored.  \n",
+    "In the [Physrisk reference application](https://physrisk-ui-physrisk.apps.odh-cl2.apps.os-climate.org/), data is stored in Amazon S3 using the ```S3Map``` as a store (see https://zarr.readthedocs.io/en/stable/api/storage.html). We generally find this to be good enough for typical needs. To expand on 'typical', a large portfolio example might have somewhere in the range 100,000—1,000,000 assets, and it is desirable to retrieve the data in order 10s of seconds. But what are the performance bottlenecks here?\n",
+    "\n",
+    "As an aside, it should be emphasized that there are other use-cases to consider, in particular performing operations on the entire data set, perhaps using [Dask](https://www.dask.org/), rather than looking up points. There are also considerations other than performance, in particular the convenience and cost-effectiveness of S3. But to come back to the performance trade-off in more detail, consider what is happening when a request for 1 million assets is made:\n",
+    "\n",
+    "1) Using the affine transform the required pixels are identified together with the Zarr chunks containing the pixels. This is typically fast.  \n",
+    "2) The compressed chunks are downloaded from S3. Thanks to the async functionality of ```S3FileSystem```, the chunks are downloaded in parallel, but this can still be a time-consuming step. Chunk size affects the performance. If the chunks are too large then too much unnecessary data is transferred (for this specific use-case); too small and compression ratios can decrease, [overall transfer speed may decrease](https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance-guidelines.html) and the number of chunks becomes large (more on this below). By default Physrisk uses, 1000×1000 spatial pixels in each chunk.\n",
+    "An important point to note is that for physical risk calculations usually all values along the 'index' dimension are required. Therefore these are always in the same chunk.  \n",
+    "Also, note that this step depends on the connection speed of the compute instance (e.g. Pod) to the S3 storage. For example, when connecting to S3 from a laptop using a home internet connection halfway across the world transfer times will be slow compared to to a Pod hosted on EC2 in the same AWS Region (as for the Physrisk reference application). There is therefore a benefit in using the Physrisk API to access hazard data rather than getting this directly from S3.  \n",
+    "3) The chunks are decompressed and the specific pixel values are looked up. The decompression can also be somewhat costly in time. \n",
+    "\n",
+    "If this default set up does not meet the performance needs, what can be done?\n",
+    "\n",
+    "### Choice of store\n",
+    "Aside from adjusting chunk size, performance can be improved by moving from S3 storage to some form of SSD-backed storage, using a different Zarr store, and thereby decreasing the time for step 2 above. Most simply, ```DirectoryStore``` could be used, but another common approach is to use a database to hold Zarr chunks binary data. Zarr stores exist for a range of databases like LMDB, SQLite and MongoDB, plus distributed in-memory stores such as Redis. As an aside, it is interesting to note that [TileDB](https://docs.tiledb.com/geospatial) is another example of a database that indexes compressed, chunked multi-dimensional raster data.\n",
+    "\n",
+    "### Move away from chunked, compressed data entirely\n",
+    "The Zarr storage of compressed, chunked data works well across a number of use-cases, but if the intent is solely to look up data from a spatial index as quickly as possible, then a database keyed on a spatial index might be preferred, albeit at the expense of other features. Members of the Physrisk project have been investigating this using H3-based indexing and databases such as [DuckDB](https://duckdb.org/) with promising results.\n",
+    "\n",
+    "## When a single file is desirable\n",
+    "It can be inefficient to store a large number of small chunks as separate files or objects when using cloud storage such as S3, and indeed typical file systems. That is, it can be preferable to store objects at more coarse granularity than chunks. Cloud Optimized GeoTIFFs take this to an extreme, in that a single object is chunked. However, while this can be read in parallel, it cannot be written in parallel in S3. Zarr similarly has a ```ZipStore``` with the same restriction.\n",
+    "\n",
+    "The Zarr [Sharding Codec](https://zarr-specs.readthedocs.io/en/latest/v3/codecs/sharding-indexed/v1.0.html), in development, might ultimately be the best solution. For now in Physrisk, should number of chunk objects become an issue, use of ```ZipStore``` is preferred at the expense of writing chunks in parallel. It is most likely that this comes up in cases where Zarr arrays are used for map tiles, a subject we will come on to.\n"
    ]
   }
  ],