Merge pull request #86 from esciencecenter-digital-skills/update-work…

…bench-intro Update episodes 1-4
esciencecenter-digital-skills · Jul 26, 2023 · bf51de3 · bf51de3
2 parents d7f1c63 + 6df732c
commit bf51de3
Show file tree

Hide file tree

Showing 14 changed files with 764 additions and 0 deletions.
diff --git a/episodes/01-intro-raster-data.md b/episodes/01-intro-raster-data.md
@@ -0,0 +1,212 @@
+---
+title: "Introduction to Raster Data"
+teaching: 15
+exercises: 5
+---
+
+:::questions
+- What format should I use to represent my data?
+- What are the main data types used for representing geospatial data?
+- What are the main attributes of raster data?
+:::
+
+:::objectives
+- Describe the difference between raster and vector data.
+- Describe the strengths and weaknesses of storing data in raster format.
+- Distinguish between continuous and categorical raster data and identify types of datasets that would be stored in each format.
+:::
+
+## Introduction
+
+This episode introduces the two primary types of geospatial
+data: rasters and vectors. After briefly introducing these
+data types, this episode focuses on raster data, describing
+some major features and types of raster data.
+
+## Data Structures: Raster and Vector
+
+The two primary types of geospatial data are raster
+and vector data. Raster data is stored as a grid of values which are rendered on a
+map as pixels. Each pixel value represents an area on the Earth's surface. Vector data structures represent specific features on the
+Earth's surface, and
+assign attributes to those features. Vector data structures
+will be discussed in more detail in [the next episode](02-intro-vector-data.md).
+
+This workshop will focus on how to work with both raster and vector
+data sets, therefore it is essential that we understand the
+basic structures of these types of data and the types of data
+that they can be used to represent.
+
+### About Raster Data
+
+Raster data is any pixelated (or gridded) data where each pixel is associated
+with a specific geographic location. The value of a pixel can be
+continuous (e.g. elevation) or categorical (e.g. land use). If this sounds
+familiar, it is because this data structure is very common: it's how
+we represent any digital image. A geospatial raster is only different
+from a digital photo in that it is accompanied by spatial information
+that connects the data to a particular location. This includes the
+raster's extent and cell size, the number of rows and columns, and
+its coordinate reference system (or CRS).
+
+![Raster Concept (Source: National Ecological Observatory Network (NEON))](fig/E01-01-raster_concept.png){alt="raster concept"}
+
+Some examples of continuous rasters include:
+
+1. Precipitation maps.
+2. Maps of tree height derived from LiDAR data.
+3. Elevation values for a region.
+
+A map of elevation for Harvard Forest derived from the [NEON AOP LiDAR sensor](https://www.neonscience.org/data-collection/airborne-remote-sensing)
+is below. Elevation is represented as a continuous numeric variable in this map. The legend
+shows the continuous range of values in the data from around 300 to 420 meters.
+
+![Continuous Elevation Map: HARV Field Site](fig/E01-02-continuous-elevation-HARV-plot-01.png){alt="elevation Harvard forest"}
+
+Some rasters contain categorical data where each pixel represents a discrete
+class such as a landcover type (e.g., "forest" or "grassland") rather than a
+continuous value such as elevation or temperature. Some examples of classified
+maps include:
+
+1. Landcover / land-use maps.
+2. Tree height maps classified as short, medium, and tall trees.
+3. Elevation maps classified as low, medium, and high elevation.
+
+![USA landcover classification](fig/E01-03-USA_landcover_classification.png){alt="USA landcover classification"}
+
+The map above shows the contiguous United States with landcover as categorical
+data. Each color is a different landcover category. (Source: Homer, C.G., et
+al., 2015, Completion of the 2011 National Land Cover Database for the
+conterminous United States-Representing a decade of land cover change
+information. Photogrammetric Engineering and Remote Sensing, v. 81, no. 5, p.
+345-354)
+
+:::challenge
+## Advantages and Disadvantages
+
+With your neighbor, brainstorm potential advantages and
+disadvantages of storing data in raster format. Add your
+ideas to the Etherpad. The Instructor will discuss and
+add any points that weren't brought up in the small group
+discussions.
+
+::::solution
+## Solution
+
+Raster data has some important advantages:
+
+* representation of continuous surfaces
+* potentially very high levels of detail
+* data is 'unweighted' across its extent - the geometry doesn't
+implicitly highlight features
+* cell-by-cell calculations can be very fast and efficient
+
+The downsides of raster data are:
+
+* very large file sizes as cell size gets smaller
+* currently popular formats don't embed metadata well (more on this later!)
+* can be difficult to represent complex information
+::::
+:::
+
+### Important Attributes of Raster Data
+
+#### Extent
+
+The spatial extent is the geographic area that the raster data covers.
+The spatial extent of an object represents the geographic edge or
+location that is the furthest north, south, east and west. In other words, extent
+represents the overall geographic coverage of the spatial object.
+
+![Spatial extent image (Image Source: National Ecological Observatory Network (NEON))](fig/E01-04-spatial_extent.png){alt="spatial extent objects"}
+
+:::challenge
+## Extent Challenge
+
+In the image above, the dashed boxes around each set of objects
+seems to imply that the three objects have the same extent. Is this
+accurate? If not, which object(s) have a different extent?
+
+::::solution
+## Solution
+
+The lines and polygon objects have the same extent. The extent for
+the points object is smaller in the vertical direction than the
+other two because there are no points on the line at y = 8.
+::::
+:::
+
+#### Resolution
+
+A resolution of a raster represents the area on the ground that each
+pixel of the raster covers. The image below illustrates the effect
+of changes in resolution.
+
+![Resolution image (Source: National Ecological Observatory Network (NEON))](fig/E01-05-raster_resolution.png){alt="resolution image"}
+
+### Raster Data Format for this Workshop
+
+Raster data can come in many different formats. For this workshop, we will use
+the GeoTIFF format which has the extension `.tif`. A `.tif` file stores metadata
+or attributes about the file as embedded `tif tags`. For instance, your camera
+might store a tag that describes the make and model of the camera or the date
+the photo was taken when it saves a `.tif`. A GeoTIFF is a standard `.tif` image
+format with additional spatial (georeferencing) information embedded in the file
+as tags. These tags should include the following raster metadata:
+
+1. Extent
+2. Resolution
+3. Coordinate Reference System (CRS) - we will introduce this concept in [a later episode](03-crs.md)
+4. Values that represent missing data (`NoDataValue`) - we will introduce this
+   concept in [a later episode](06-raster-intro.md).
+
+We will discuss these attributes in more detail in [a later episode](06-raster-intro.md).
+In that episode, we will also learn how to use Python to extract raster attributes
+from a GeoTIFF file.
+
+:::callout
+## More Resources on the  `.tif` format
+
+* [GeoTIFF on Wikipedia](https://en.wikipedia.org/wiki/GeoTIFF)
+* [OSGEO TIFF documentation](https://trac.osgeo.org/geotiff/)
+:::
+
+### Multi-band Raster Data
+
+A raster can contain one or more bands. One type of multi-band raster
+dataset that is familiar to many of us is a color
+image. A basic color image consists of three bands: red, green, and blue.
+Each
+band represents light reflected from the red, green or blue portions of
+the
+electromagnetic spectrum. The pixel brightness for each band, when
+composited
+creates the colors that we see in an image.
+
+![RGB multi-band raster image (Source: National Ecological Observatory Network (NEON).)](fig/E01-06-RGBSTack_1.jpg){alt="multi-band raster"}
+
+We can plot each band of a multi-band image individually.
+
+Or we can composite all three bands together to make a color image.
+
+In a multi-band dataset, the rasters will always have the same extent,
+resolution, and CRS.
+
+:::callout
+## Other Types of Multi-band Raster Data
+
+Multi-band raster data might also contain:
+1. **Time series:** the same variable, over the same area, over time.
+2. **Multi or hyperspectral imagery:** image rasters that have 4 or
+more (multi-spectral) or more than 10-15 (hyperspectral) bands. We
+won't be working with this type of data in this workshop, but you can
+check out the NEON Data Skills [Imaging Spectroscopy HDF5 in R](https://www.neonscience.org/hsi-hdf5-r)
+tutorial if you're interested in working with hyperspectral data cubes.
+:::
+
+:::keypoints
+- Raster data is pixelated data where each pixel is associated with a specific location.
+- Raster data always has an extent and a resolution.
+- The extent is the geographical area covered by a raster.
+- The resolution is the area covered by each pixel of a raster.
+:::
diff --git a/episodes/02-intro-vector-data.md b/episodes/02-intro-vector-data.md
@@ -0,0 +1,139 @@
+---
+title: "Introduction to Vector Data"
+teaching: 10
+exercises: 5
+---
+
+:::questions
+- What are the main attributes of vector data?
+:::
+
+:::objectives
+- Describe the strengths and weaknesses of storing data in vector format.
+- Describe the three types of vectors and identify types of data that would be stored in each.
+:::
+
+## About Vector Data
+
+Vector data structures represent specific features on the Earth's surface, and
+assign attributes to those features. Vectors are composed of discrete geometric
+locations (x, y values) known as vertices that define the shape of the spatial
+object. The organization of the vertices determines the type of vector that we
+are working with: point, line or polygon.
+
+![Types of vector objects (Image Source: National Ecological Observatory Network (NEON))](fig/E02-01-pnt_line_poly.png){alt="vector data types"}
+
+* **Points:** Each point is defined by a single x, y coordinate. There can be
+many points in a vector point file. Examples of point data include: sampling
+locations, the location of individual trees, or the location of survey plots.
+
+* **Lines:** Lines are composed of many (at least 2) points that are connected.
+For instance, a road or a stream may be represented by a line. This line is
+composed of a series of segments, each "bend" in the road or stream represents a
+vertex that has a defined x, y location.
+
+* **Polygons:** A polygon consists of 3 or more vertices that are connected and
+closed. The outlines of survey plot boundaries, lakes, oceans, and states or
+countries are often represented by polygons.
+
+:::callout
+## Data Tip
+
+Sometimes, boundary layers such as states and countries, are stored as lines
+rather than polygons. However, these boundaries, when represented as a line,
+will not create a closed object with a defined area that can be filled.
+:::
+
+:::challenge
+## Identify Vector Types
+
+The plot below includes examples of two of the three types of vector
+objects. Use the definitions above to identify which features
+are represented by which vector type.
+
+![Vector Type Examples](fig/E02-02-vector_types_examples.png){alt="vector type examples"}
+
+::::solution
+## Solution
+
+State boundaries are polygons. The Fisher Tower location is
+a point. There are no line features shown.
+::::
+:::
+
+Vector data has some important advantages:
+
+* The geometry itself contains information about what the dataset creator thought was important
+* The geometry structures hold information in themselves - why choose point over polygon, for instance?
+* Each geometry feature can carry multiple attributes instead of just one, e.g. a database of cities can have attributes for name, country, population, etc
+* Data storage can be very efficient compared to rasters
+
+The downsides of vector data include:
+
+* Potential loss of detail compared to raster
+* Potential bias in datasets - what didn't get recorded?
+* Calculations involving multiple vector layers need to do math on the
+  geometry as well as the attributes, so can be slow compared to raster math.
+
+Vector datasets are in use in many industries besides geospatial fields. For
+instance, computer graphics are largely vector-based, although the data
+structures in use tend to join points using arcs and complex curves rather than
+straight lines. Computer-aided design (CAD) is also vector- based. The
+difference is that geospatial datasets are accompanied by information tying
+their features to real-world locations.
+
+## Vector Data Format for this Workshop
+
+Like raster data, vector data can also come in many different formats. For this
+workshop, we will use the Shapefile format. A Shapefile format consists of multiple
+files in the same directory, of which `.shp`, `.shx`, and `.dbf` files are mandatory. Other non-mandatory but very important files are `.prj` and `shp.xml` files.
+
+- The `.shp` file stores the feature geometry itself
+- `.shx` is a positional index of the feature geometry to allow quickly searching forwards and backwards the geographic coordinates of each vertex in the vector
+- `.dbf` contains the tabular attributes for each shape.
+- `.prj` file indicates the Coordinate reference system (CRS)
+- `.shp.xml` contains the Shapefile metadata.
+
+Together, the Shapefile includes the following information:
+
+* **Extent** - the spatial extent of the shapefile (i.e. geographic area that
+the shapefile covers). The spatial extent for a shapefile represents the
+combined extent for all spatial objects in the shapefile.
+* **Object type** - whether the shapefile includes points, lines, or polygons.
+* **Coordinate reference system (CRS)**
+* **Other attributes** - for example, a line shapefile that contains the
+locations of streams, might contain the name of each stream.
+
+Because the structure of points, lines, and polygons are different, each
+individual shapefile can only contain one vector type (all points, all lines
+or all polygons). You will not find a mixture of point, line and polygon
+objects in a single shapefile.
+
+:::callout
+## More Resources on Shapefiles
+
+More about shapefiles can be found on
+[Wikipedia.](https://en.wikipedia.org/wiki/Shapefile) Shapefiles are often publicly
+available from government services, such as [this page from the US Census Bureau][us-cb] or
+[this one from Australia's Data.gov.au website](https://data.gov.au/data/dataset?res_format=SHP).
+:::
+
+:::callout
+## Why not both?
+
+Very few formats can contain both raster and vector data - in fact, most are
+even more restrictive than that. Vector datasets are usually locked to one
+geometry type, e.g. points only. Raster datasets can usually only encode one
+data type, for example you can't have a multiband GeoTIFF where one layer is
+integer data and another is floating-point. There are sound reasons for this -
+format standards are easier to define and maintain, and so is metadata. The
+effects of particular data manipulations are more predictable if you are
+confident that all of your input data has the same characteristics.
+:::
+
+[us-cb]: https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html
+
+:::keypoints
+- Vector data structures represent specific features on the Earth's surface along with attributes of those features.
+- Vector objects are either points, lines, or polygons.
+:::