Merge pull request #121 from esciencecenter-digital-skills/draft_3004…

…2024 Updating the material to the new narrative
esciencecenter-digital-skills · Jun 25, 2024 · 1b6d24d · 1b6d24d
2 parents f259250 + 5f7e9bd
commit 1b6d24d
Show file tree

Hide file tree

Showing 81 changed files with 1,412 additions and 1,223 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,6 @@
+# notebooks update narrative NleSc
+notebooks/
+
 # sandpaper files
 episodes/*html
 site/*

diff --git a/episodes/01-intro-raster-data.md b/episodes/01-intro-raster-data.md
@@ -15,49 +15,32 @@ exercises: 5
 - Describe the strengths and weaknesses of storing data in raster format.
 - Distinguish between continuous and categorical raster data and identify types of datasets that would be stored in each format.
 :::
+
 
 ## Introduction
 
-This episode introduces the two primary types of geospatial
-data: rasters and vectors. After briefly introducing these
-data types, this episode focuses on raster data, describing
-some major features and types of raster data.
+This episode introduces the two primary types of data models that are used to digitally represent the earth's surface: raster and vector. After briefly introducing these data models, this episode focuses on the raster representation, describing some major features and types of raster data. This workshop will focus on how to work with both raster and vector data sets, therefore it is essential that we understand the basic structures of these types of data and the types of phenomena that they can represent.
 
 ## Data Structures: Raster and Vector
 
-The two primary types of geospatial data are raster
-and vector data. Raster data is stored as a grid of values which are rendered on a
-map as pixels. Each pixel value represents an area on the Earth's surface. Vector data structures represent specific features on the
-Earth's surface, and
-assign attributes to those features. Vector data structures
-will be discussed in more detail in [the next episode](02-intro-vector-data.md).
-
-This workshop will focus on how to work with both raster and vector
-data sets, therefore it is essential that we understand the
-basic structures of these types of data and the types of data
-that they can be used to represent.
-
-### About Raster Data
-
-Raster data is any pixelated (or gridded) data where each pixel is associated
-with a specific geographic location. The value of a pixel can be
-continuous (e.g. elevation) or categorical (e.g. land use). If this sounds
-familiar, it is because this data structure is very common: it's how
-we represent any digital image. A geospatial raster is only different
-from a digital photo in that it is accompanied by spatial information
-that connects the data to a particular location. This includes the
-raster's extent and cell size, the number of rows and columns, and
-its coordinate reference system (or CRS).
+The two primary data models that are used to represent the earth's surface digitally are the raster and vector. **Raster data** is stored as a grid of values which are rendered on a map as pixels—also known as cells—where each pixel—or cell—represents a value of the earth's surface. Examples of raster data are satellite images or aerial photographs. Data stored according to the **vector data** model are represented by points, lines, or polygons. Examples of vector representation are points of interest, buildings—often represented as building footprints—or roads.
+
+Representing phenomena as vector data allows you to add attribute information to them. For instance, a polygon of a house can contain multiple attributes containing information about the address like the street name, zip code, city, and number. More explanations about vector data will be discussed in the [next episode](02-intro-vector-data.md).
+
+When working with spatial information, you will experience that many phenomena can be represented as vector data and raster data. A house, for instance, can be represented by a set of cells in a raster having all the same value or by a polygon as vector containing attribute information (figure 1). It depends on the purpose for which the data is collected and intended to be used which data model it is stored in. But as a rule of thumb, you can apply that discrete phenomena like buildings, roads, trees, signs are represented as vector data, whereas continuous phenomena like temperature, wind speed, elevation are represented as raster data. Yet, one of the things a spatial data analyst often has to do is to transform data from vector to raster or the other way around. Keep in mind that this can cause problems in the data quality.
+
+### Raster Data
+
+Raster data is any pixelated (or gridded) data where each pixel has a value and is associated with a specific geographic location. The value of a pixel can be continuous (e.g., elevation, temperature) or categorical (e.g., land-use type). If this sounds familiar, it is because this data structure is very common: it's how we represent any digital image. A geospatial raster is only different from a digital photo in that it is accompanied by spatial information that connects the data to a particular location. This includes the raster's extent and cell size, the number of rows and columns, and its Coordinate Reference System (CRS), which will be explained in [episode 3](03-crs.md) of this workshop.
 
 ![Raster Concept (Source: National Ecological Observatory Network (NEON))](fig/E01/raster_concept.png){alt="raster concept"}
 
 Some examples of continuous rasters include:
 
 1. Precipitation maps.
-2. Maps of tree height derived from LiDAR data.
-3. Elevation values for a region.
+2. Elevation maps.
 
-A map of elevation for Harvard Forest derived from the [NEON AOP LiDAR sensor](https://www.neonscience.org/data-collection/airborne-remote-sensing)
+A map of elevation for *Harvard Forest* derived from the [NEON AOP LiDAR sensor](https://www.neonscience.org/data-collection/airborne-remote-sensing)
 is below. Elevation is represented as a continuous numeric variable in this map. The legend
 shows the continuous range of values in the data from around 300 to 420 meters.
 
@@ -69,8 +52,7 @@ continuous value such as elevation or temperature. Some examples of classified
 maps include:
 
 1. Landcover / land-use maps.
-2. Tree height maps classified as short, medium, and tall trees.
-3. Elevation maps classified as low, medium, and high elevation.
+2. Elevation maps classified as low, medium, and high elevation.
 
 ![USA landcover classification](fig/E01/USA_landcover_classification.png){alt="USA landcover classification"}
 
@@ -147,12 +129,12 @@ of changes in resolution.
 ### Raster Data Format for this Workshop
 
 Raster data can come in many different formats. For this workshop, we will use
-the GeoTIFF format which has the extension `.tif`. A `.tif` file stores metadata
-or attributes about the file as embedded `tif tags`. For instance, your camera
+the GeoTIFF format which has the extension `.tif`, since this is one of the most common formats to be used. 
+A `.tif` file stores metadata or attributes about the file as embedded `tif tags`. For instance, your camera
 might store a tag that describes the make and model of the camera or the date
 the photo was taken when it saves a `.tif`. A GeoTIFF is a standard `.tif` image
 format with additional spatial (georeferencing) information embedded in the file
-as tags. These tags should include the following raster metadata:
+as tags. These tags include the following raster metadata:
 
 1. Extent
 2. Resolution
@@ -174,14 +156,11 @@ from a GeoTIFF file.
 ### Multi-band Raster Data
 
 A raster can contain one or more bands. One type of multi-band raster
-dataset that is familiar to many of us is a color
-image. A basic color image consists of three bands: red, green, and blue.
-Each
+dataset that is familiar to many of us is a color image. A basic color 
+image often consists of three bands: red, green, and blue (RGB). Each
 band represents light reflected from the red, green or blue portions of
-the
-electromagnetic spectrum. The pixel brightness for each band, when
-composited
-creates the colors that we see in an image.
+the electromagnetic spectrum. The pixel brightness for each band, when
+composited creates the colors that we see in an image.
 
 ![RGB multi-band raster image (Source: National Ecological Observatory Network (NEON).)](fig/E01/RGBSTack_1.jpg){alt="multi-band raster"}
 

diff --git a/episodes/02-intro-vector-data.md b/episodes/02-intro-vector-data.md
@@ -34,7 +34,9 @@ vertex that has a defined x, y location.
 
 * **Polygons:** A polygon consists of 3 or more vertices that are connected and
 closed. The outlines of survey plot boundaries, lakes, oceans, and states or
-countries are often represented by polygons.
+countries are often represented by polygons. Note, that polygons can also contain one 
+or multiple holes, for instance a plot boundary with a lake in it. These polygons are 
+considered *complex* or *donut* polygons. 
 
 :::callout
 ## Data Tip
@@ -56,8 +58,11 @@ are represented by which vector type.
 ::::solution
 ## Solution
 
-State boundaries are polygons. The Fisher Tower location is
-a point. There are no line features shown.
+State boundaries are shown as polygons. The Fisher Tower location is
+represented by a purple point. There are no line features shown. 
+Note, that at a different scale the Fischer Tower coudl also have been represented as a polygon. 
+Keep in mind that the purpose for which the dataset is created and aimed to be used for determines 
+which vector type it uses. 
 ::::
 :::
 
@@ -66,14 +71,14 @@ Vector data has some important advantages:
 * The geometry itself contains information about what the dataset creator thought was important
 * The geometry structures hold information in themselves - why choose point over polygon, for instance?
 * Each geometry feature can carry multiple attributes instead of just one, e.g. a database of cities can have attributes for name, country, population, etc
-* Data storage can be very efficient compared to rasters
+* Data storage can, depending on the scale, be very efficient compared to rasters
+* When working with network analysis, for instance to calculate the shortest route between A and B, topologically correct lines are essential. This is not possible through raster data. 
 
 The downsides of vector data include:
 
-* Potential loss of detail compared to raster
-* Potential bias in datasets - what didn't get recorded?
+* Potential bias in datasets - what didn't get recorded? Often vector data are interpreted datasets like topographical maps and have been collected by someone else, for another purpose.
 * Calculations involving multiple vector layers need to do math on the
-  geometry as well as the attributes, so can be slow compared to raster math.
+  geometry as well as the attributes, which potentially can be slow compared to raster calculations.
 
 Vector datasets are in use in many industries besides geospatial fields. For
 instance, computer graphics are largely vector-based, although the data
@@ -85,8 +90,9 @@ their features to real-world locations.
 ## Vector Data Format for this Workshop
 
 Like raster data, vector data can also come in many different formats. For this
-workshop, we will use the Shapefile format. A Shapefile format consists of multiple
-files in the same directory, of which `.shp`, `.shx`, and `.dbf` files are mandatory. Other non-mandatory but very important files are `.prj` and `shp.xml` files.
+workshop, we will use the GeoPackage format. GeoPackage is developed by the [Open Geospatial Consortium](https://www.ogc.org/) and is *is an open, standards-based, platform-independent, portable, self-describing, compact format for transferring geospatial information.* (source: [https://www.geopackage.org/](https://www.geopackage.org/) ) A GeoPackage file, **.gpkg**, is a single file that contains the geometries of features, their attributes and information about the CRS used.  
+
+Another vector format that you will probably come accross quite often is a Shapefile. Although we will not be using that format in this workshop we do believe it is useful to understand how that format works. A Shapefile format consists of multiple files in the same directory, of which `.shp`, `.shx`, and `.dbf` files are mandatory. Other non-mandatory but very important files are `.prj` and `shp.xml` files.
 
 - The `.shp` file stores the feature geometry itself
 - `.shx` is a positional index of the feature geometry to allow quickly searching forwards and backwards the geographic coordinates of each vertex in the vector
@@ -114,8 +120,8 @@ objects in a single shapefile.
 
 More about shapefiles can be found on
 [Wikipedia.](https://en.wikipedia.org/wiki/Shapefile) Shapefiles are often publicly
-available from government services, such as [this page from the US Census Bureau][us-cb] or
-[this one from Australia's Data.gov.au website](https://data.gov.au/data/dataset?res_format=SHP).
+available from government services, such as [this page containing all administrative boundaries for countries in the world](https://gadm.org/download_country.html) or
+[topographical vector data from Open Street Maps](https://download.geofabrik.de/).
 :::
 
 :::callout
@@ -131,9 +137,9 @@ effects of particular data manipulations are more predictable if you are
 confident that all of your input data has the same characteristics.
 :::
 
-[us-cb]: https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html
 
 :::keypoints
 - Vector data structures represent specific features on the Earth's surface along with attributes of those features.
+- Vector data is often interpreted data and collected for a different purpose than you would want to use it for.
 - Vector objects are either points, lines, or polygons.
 :::
diff --git a/episodes/03-crs.md b/episodes/03-crs.md
@@ -27,22 +27,22 @@ CRS (coordinate reference system) and SRS (spatial reference system) are synonym
 will use only CRS throughout this workshop.
 :::
 
-The CRS associated with a dataset tells your mapping software (for example Python)
+The CRS associated with a dataset tells your mapping software
 where the raster is located in geographic space. It also tells the mapping
 software what method should be used to flatten or project the raster in
 geographic space.
 
-![Maps of the United States in different projections (Source: opennews.org)](https://media.opennews.org/cache/06/37/0637aa2541b31f526ad44f7cb2db7b6c.jpg){alt="US difference projections"}
-
-The above image shows maps of the United States in different projections. Notice
+The image below (figure 3.1) shows maps of the United States in different projections. Notice
 the differences in shape associated with each projection. These differences are
 a direct result of the calculations used to flatten the data onto a
 2-dimensional map.
 
+![Figure 3.1: Maps of the United States in different projections (Source: opennews.org)](https://media.opennews.org/cache/06/37/0637aa2541b31f526ad44f7cb2db7b6c.jpg){alt="US difference projections"}
+
 There are lots of great resources that describe coordinate reference systems and
 projections in greater detail. For the purposes of this workshop, what is
 important to understand is that data from the same location but saved in
-different projections will not line up in any GIS or other program. Thus, it's
+different projections will not line up. Thus, it is
 important when working with spatial data to identify the coordinate reference
 system applied to the data and retain it throughout data processing and
 analysis.
@@ -55,14 +55,14 @@ CRS information has three components:
 degrees) and defines the starting point (i.e. where is [0,0]?) so the angles
 reference a meaningful spot on the earth. Common global datums are WGS84 and
 NAD83. Datums can also be local - fit to a particular area of the globe, but
-ill-fitting outside the area of intended use. In this workshop, we will use the
-[WGS84
-datum](https://www.linz.govt.nz/data/geodetic-system/datums-projections-and-heights/geodetic-datums/world-geodetic-system-1984-wgs84).
+ill-fitting outside the area of intended use For instance local Cadastre, Land Registry and Mapping Agencies require a high quality of their datasets, which can be obtained using a local system. In this workshop, we will use the
+[WGS84 datum](https://www.linz.govt.nz/data/geodetic-system/datums-projections-and-heights/geodetic-datums/world-geodetic-system-1984-wgs84). The datum is often also refered to as the Geographical Coordinate System.
 
 * **Projection:** A mathematical transformation of the angular measurements on a
 round earth to a flat surface (i.e. paper or a computer screen). The units
 associated with a given projection are usually linear (feet, meters, etc.). In
 this workshop, we will see data in two different projections.
+Note that the projection is also often refered to as Projected Coordinate System.
 
 * **Additional Parameters:** Additional parameters are often necessary to create
 the full coordinate reference system. One common additional parameter is a
@@ -91,6 +91,8 @@ stem of the fruit. What other parameters could be included in this analogy?
 
 ## Which projection should I use?
 
+A well know projection is the [Mercator projection](https://en.wikipedia.org/wiki/Mercator_projection) introduced by the Flamisch Cartographer Gerardus Mercator in the 16th Century. This so-called cilindrical projection, meaning that a virtual cilinder is place on the globe to flatten it, it relatively accurate near to the equator, but towards the poles blows things up see:[Cylindrical projection](https://gisgeography.com/cylindrical-projection/). The main advantage of the Mercator projection is that it is very suitable for navigation purpuses since it always north as *up* and south and as *down*, in the 17th century this projection was essential for sailors to navigate the oceans.
+
 To decide if a projection is right for your data, answer these questions:
 
   *  What is the area of minimal distortion?
@@ -212,9 +214,11 @@ generated and maintained manually.
 * [Choosing the Right Map Projection.](https://source.opennews.org/en-US/learning/choosing-right-map-projection/)
 * [Video](https://www.youtube.com/embed/KUF_Ckv8HbE) highlighting how map projections can make continents
 seems proportionally larger or smaller than they actually are.
+* [The True size](https://www.thetruesize.com/) An intuitive webmap that allows you to drag countries to another place in the webmercator projection.
 :::
 
 :::keypoints
 - All geospatial datasets (raster and vector) are associated with a specific coordinate reference system.
 - A coordinate reference system includes datum, projection, and additional parameters specific to the dataset.
+- All maps are distored because of the projection.
 :::