Skip to content

Commit

Permalink
Merge pull request #121 from esciencecenter-digital-skills/draft_3004…
Browse files Browse the repository at this point in the history
…2024

Updating the material to the new narrative
  • Loading branch information
Morrizzzzz authored Jun 25, 2024
2 parents f259250 + 5f7e9bd commit 1b6d24d
Show file tree
Hide file tree
Showing 81 changed files with 1,412 additions and 1,223 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
# notebooks update narrative NleSc
notebooks/

# sandpaper files
episodes/*html
site/*
Expand Down
63 changes: 21 additions & 42 deletions episodes/01-intro-raster-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,49 +15,32 @@ exercises: 5
- Describe the strengths and weaknesses of storing data in raster format.
- Distinguish between continuous and categorical raster data and identify types of datasets that would be stored in each format.
:::


## Introduction

This episode introduces the two primary types of geospatial
data: rasters and vectors. After briefly introducing these
data types, this episode focuses on raster data, describing
some major features and types of raster data.
This episode introduces the two primary types of data models that are used to digitally represent the earth's surface: raster and vector. After briefly introducing these data models, this episode focuses on the raster representation, describing some major features and types of raster data. This workshop will focus on how to work with both raster and vector data sets, therefore it is essential that we understand the basic structures of these types of data and the types of phenomena that they can represent.

## Data Structures: Raster and Vector

The two primary types of geospatial data are raster
and vector data. Raster data is stored as a grid of values which are rendered on a
map as pixels. Each pixel value represents an area on the Earth's surface. Vector data structures represent specific features on the
Earth's surface, and
assign attributes to those features. Vector data structures
will be discussed in more detail in [the next episode](02-intro-vector-data.md).

This workshop will focus on how to work with both raster and vector
data sets, therefore it is essential that we understand the
basic structures of these types of data and the types of data
that they can be used to represent.

### About Raster Data

Raster data is any pixelated (or gridded) data where each pixel is associated
with a specific geographic location. The value of a pixel can be
continuous (e.g. elevation) or categorical (e.g. land use). If this sounds
familiar, it is because this data structure is very common: it's how
we represent any digital image. A geospatial raster is only different
from a digital photo in that it is accompanied by spatial information
that connects the data to a particular location. This includes the
raster's extent and cell size, the number of rows and columns, and
its coordinate reference system (or CRS).
The two primary data models that are used to represent the earth's surface digitally are the raster and vector. **Raster data** is stored as a grid of values which are rendered on a map as pixels—also known as cells—where each pixel—or cell—represents a value of the earth's surface. Examples of raster data are satellite images or aerial photographs. Data stored according to the **vector data** model are represented by points, lines, or polygons. Examples of vector representation are points of interest, buildings—often represented as building footprints—or roads.

Representing phenomena as vector data allows you to add attribute information to them. For instance, a polygon of a house can contain multiple attributes containing information about the address like the street name, zip code, city, and number. More explanations about vector data will be discussed in the [next episode](02-intro-vector-data.md).

When working with spatial information, you will experience that many phenomena can be represented as vector data and raster data. A house, for instance, can be represented by a set of cells in a raster having all the same value or by a polygon as vector containing attribute information (figure 1). It depends on the purpose for which the data is collected and intended to be used which data model it is stored in. But as a rule of thumb, you can apply that discrete phenomena like buildings, roads, trees, signs are represented as vector data, whereas continuous phenomena like temperature, wind speed, elevation are represented as raster data. Yet, one of the things a spatial data analyst often has to do is to transform data from vector to raster or the other way around. Keep in mind that this can cause problems in the data quality.

### Raster Data

Raster data is any pixelated (or gridded) data where each pixel has a value and is associated with a specific geographic location. The value of a pixel can be continuous (e.g., elevation, temperature) or categorical (e.g., land-use type). If this sounds familiar, it is because this data structure is very common: it's how we represent any digital image. A geospatial raster is only different from a digital photo in that it is accompanied by spatial information that connects the data to a particular location. This includes the raster's extent and cell size, the number of rows and columns, and its Coordinate Reference System (CRS), which will be explained in [episode 3](03-crs.md) of this workshop.

![Raster Concept (Source: National Ecological Observatory Network (NEON))](fig/E01/raster_concept.png){alt="raster concept"}

Some examples of continuous rasters include:

1. Precipitation maps.
2. Maps of tree height derived from LiDAR data.
3. Elevation values for a region.
2. Elevation maps.

A map of elevation for Harvard Forest derived from the [NEON AOP LiDAR sensor](https://www.neonscience.org/data-collection/airborne-remote-sensing)
A map of elevation for *Harvard Forest* derived from the [NEON AOP LiDAR sensor](https://www.neonscience.org/data-collection/airborne-remote-sensing)
is below. Elevation is represented as a continuous numeric variable in this map. The legend
shows the continuous range of values in the data from around 300 to 420 meters.

Expand All @@ -69,8 +52,7 @@ continuous value such as elevation or temperature. Some examples of classified
maps include:

1. Landcover / land-use maps.
2. Tree height maps classified as short, medium, and tall trees.
3. Elevation maps classified as low, medium, and high elevation.
2. Elevation maps classified as low, medium, and high elevation.

![USA landcover classification](fig/E01/USA_landcover_classification.png){alt="USA landcover classification"}

Expand Down Expand Up @@ -147,12 +129,12 @@ of changes in resolution.
### Raster Data Format for this Workshop

Raster data can come in many different formats. For this workshop, we will use
the GeoTIFF format which has the extension `.tif`. A `.tif` file stores metadata
or attributes about the file as embedded `tif tags`. For instance, your camera
the GeoTIFF format which has the extension `.tif`, since this is one of the most common formats to be used.
A `.tif` file stores metadata or attributes about the file as embedded `tif tags`. For instance, your camera
might store a tag that describes the make and model of the camera or the date
the photo was taken when it saves a `.tif`. A GeoTIFF is a standard `.tif` image
format with additional spatial (georeferencing) information embedded in the file
as tags. These tags should include the following raster metadata:
as tags. These tags include the following raster metadata:

1. Extent
2. Resolution
Expand All @@ -174,14 +156,11 @@ from a GeoTIFF file.
### Multi-band Raster Data

A raster can contain one or more bands. One type of multi-band raster
dataset that is familiar to many of us is a color
image. A basic color image consists of three bands: red, green, and blue.
Each
dataset that is familiar to many of us is a color image. A basic color
image often consists of three bands: red, green, and blue (RGB). Each
band represents light reflected from the red, green or blue portions of
the
electromagnetic spectrum. The pixel brightness for each band, when
composited
creates the colors that we see in an image.
the electromagnetic spectrum. The pixel brightness for each band, when
composited creates the colors that we see in an image.

![RGB multi-band raster image (Source: National Ecological Observatory Network (NEON).)](fig/E01/RGBSTack_1.jpg){alt="multi-band raster"}

Expand Down
30 changes: 18 additions & 12 deletions episodes/02-intro-vector-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,9 @@ vertex that has a defined x, y location.

* **Polygons:** A polygon consists of 3 or more vertices that are connected and
closed. The outlines of survey plot boundaries, lakes, oceans, and states or
countries are often represented by polygons.
countries are often represented by polygons. Note, that polygons can also contain one
or multiple holes, for instance a plot boundary with a lake in it. These polygons are
considered *complex* or *donut* polygons.

:::callout
## Data Tip
Expand All @@ -56,8 +58,11 @@ are represented by which vector type.
::::solution
## Solution

State boundaries are polygons. The Fisher Tower location is
a point. There are no line features shown.
State boundaries are shown as polygons. The Fisher Tower location is
represented by a purple point. There are no line features shown.
Note, that at a different scale the Fischer Tower coudl also have been represented as a polygon.
Keep in mind that the purpose for which the dataset is created and aimed to be used for determines
which vector type it uses.
::::
:::

Expand All @@ -66,14 +71,14 @@ Vector data has some important advantages:
* The geometry itself contains information about what the dataset creator thought was important
* The geometry structures hold information in themselves - why choose point over polygon, for instance?
* Each geometry feature can carry multiple attributes instead of just one, e.g. a database of cities can have attributes for name, country, population, etc
* Data storage can be very efficient compared to rasters
* Data storage can, depending on the scale, be very efficient compared to rasters
* When working with network analysis, for instance to calculate the shortest route between A and B, topologically correct lines are essential. This is not possible through raster data.

The downsides of vector data include:

* Potential loss of detail compared to raster
* Potential bias in datasets - what didn't get recorded?
* Potential bias in datasets - what didn't get recorded? Often vector data are interpreted datasets like topographical maps and have been collected by someone else, for another purpose.
* Calculations involving multiple vector layers need to do math on the
geometry as well as the attributes, so can be slow compared to raster math.
geometry as well as the attributes, which potentially can be slow compared to raster calculations.

Vector datasets are in use in many industries besides geospatial fields. For
instance, computer graphics are largely vector-based, although the data
Expand All @@ -85,8 +90,9 @@ their features to real-world locations.
## Vector Data Format for this Workshop

Like raster data, vector data can also come in many different formats. For this
workshop, we will use the Shapefile format. A Shapefile format consists of multiple
files in the same directory, of which `.shp`, `.shx`, and `.dbf` files are mandatory. Other non-mandatory but very important files are `.prj` and `shp.xml` files.
workshop, we will use the GeoPackage format. GeoPackage is developed by the [Open Geospatial Consortium](https://www.ogc.org/) and is *is an open, standards-based, platform-independent, portable, self-describing, compact format for transferring geospatial information.* (source: [https://www.geopackage.org/](https://www.geopackage.org/) ) A GeoPackage file, **.gpkg**, is a single file that contains the geometries of features, their attributes and information about the CRS used.

Another vector format that you will probably come accross quite often is a Shapefile. Although we will not be using that format in this workshop we do believe it is useful to understand how that format works. A Shapefile format consists of multiple files in the same directory, of which `.shp`, `.shx`, and `.dbf` files are mandatory. Other non-mandatory but very important files are `.prj` and `shp.xml` files.

- The `.shp` file stores the feature geometry itself
- `.shx` is a positional index of the feature geometry to allow quickly searching forwards and backwards the geographic coordinates of each vertex in the vector
Expand Down Expand Up @@ -114,8 +120,8 @@ objects in a single shapefile.

More about shapefiles can be found on
[Wikipedia.](https://en.wikipedia.org/wiki/Shapefile) Shapefiles are often publicly
available from government services, such as [this page from the US Census Bureau][us-cb] or
[this one from Australia's Data.gov.au website](https://data.gov.au/data/dataset?res_format=SHP).
available from government services, such as [this page containing all administrative boundaries for countries in the world](https://gadm.org/download_country.html) or
[topographical vector data from Open Street Maps](https://download.geofabrik.de/).
:::

:::callout
Expand All @@ -131,9 +137,9 @@ effects of particular data manipulations are more predictable if you are
confident that all of your input data has the same characteristics.
:::

[us-cb]: https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html

:::keypoints
- Vector data structures represent specific features on the Earth's surface along with attributes of those features.
- Vector data is often interpreted data and collected for a different purpose than you would want to use it for.
- Vector objects are either points, lines, or polygons.
:::
20 changes: 12 additions & 8 deletions episodes/03-crs.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,22 +27,22 @@ CRS (coordinate reference system) and SRS (spatial reference system) are synonym
will use only CRS throughout this workshop.
:::

The CRS associated with a dataset tells your mapping software (for example Python)
The CRS associated with a dataset tells your mapping software
where the raster is located in geographic space. It also tells the mapping
software what method should be used to flatten or project the raster in
geographic space.

![Maps of the United States in different projections (Source: opennews.org)](https://media.opennews.org/cache/06/37/0637aa2541b31f526ad44f7cb2db7b6c.jpg){alt="US difference projections"}

The above image shows maps of the United States in different projections. Notice
The image below (figure 3.1) shows maps of the United States in different projections. Notice
the differences in shape associated with each projection. These differences are
a direct result of the calculations used to flatten the data onto a
2-dimensional map.

![Figure 3.1: Maps of the United States in different projections (Source: opennews.org)](https://media.opennews.org/cache/06/37/0637aa2541b31f526ad44f7cb2db7b6c.jpg){alt="US difference projections"}

There are lots of great resources that describe coordinate reference systems and
projections in greater detail. For the purposes of this workshop, what is
important to understand is that data from the same location but saved in
different projections will not line up in any GIS or other program. Thus, it's
different projections will not line up. Thus, it is
important when working with spatial data to identify the coordinate reference
system applied to the data and retain it throughout data processing and
analysis.
Expand All @@ -55,14 +55,14 @@ CRS information has three components:
degrees) and defines the starting point (i.e. where is [0,0]?) so the angles
reference a meaningful spot on the earth. Common global datums are WGS84 and
NAD83. Datums can also be local - fit to a particular area of the globe, but
ill-fitting outside the area of intended use. In this workshop, we will use the
[WGS84
datum](https://www.linz.govt.nz/data/geodetic-system/datums-projections-and-heights/geodetic-datums/world-geodetic-system-1984-wgs84).
ill-fitting outside the area of intended use For instance local Cadastre, Land Registry and Mapping Agencies require a high quality of their datasets, which can be obtained using a local system. In this workshop, we will use the
[WGS84 datum](https://www.linz.govt.nz/data/geodetic-system/datums-projections-and-heights/geodetic-datums/world-geodetic-system-1984-wgs84). The datum is often also refered to as the Geographical Coordinate System.

* **Projection:** A mathematical transformation of the angular measurements on a
round earth to a flat surface (i.e. paper or a computer screen). The units
associated with a given projection are usually linear (feet, meters, etc.). In
this workshop, we will see data in two different projections.
Note that the projection is also often refered to as Projected Coordinate System.

* **Additional Parameters:** Additional parameters are often necessary to create
the full coordinate reference system. One common additional parameter is a
Expand Down Expand Up @@ -91,6 +91,8 @@ stem of the fruit. What other parameters could be included in this analogy?

## Which projection should I use?

A well know projection is the [Mercator projection](https://en.wikipedia.org/wiki/Mercator_projection) introduced by the Flamisch Cartographer Gerardus Mercator in the 16th Century. This so-called cilindrical projection, meaning that a virtual cilinder is place on the globe to flatten it, it relatively accurate near to the equator, but towards the poles blows things up see:[Cylindrical projection](https://gisgeography.com/cylindrical-projection/). The main advantage of the Mercator projection is that it is very suitable for navigation purpuses since it always north as *up* and south and as *down*, in the 17th century this projection was essential for sailors to navigate the oceans.

To decide if a projection is right for your data, answer these questions:

* What is the area of minimal distortion?
Expand Down Expand Up @@ -212,9 +214,11 @@ generated and maintained manually.
* [Choosing the Right Map Projection.](https://source.opennews.org/en-US/learning/choosing-right-map-projection/)
* [Video](https://www.youtube.com/embed/KUF_Ckv8HbE) highlighting how map projections can make continents
seems proportionally larger or smaller than they actually are.
* [The True size](https://www.thetruesize.com/) An intuitive webmap that allows you to drag countries to another place in the webmercator projection.
:::

:::keypoints
- All geospatial datasets (raster and vector) are associated with a specific coordinate reference system.
- A coordinate reference system includes datum, projection, and additional parameters specific to the dataset.
- All maps are distored because of the projection.
:::
Loading

0 comments on commit 1b6d24d

Please sign in to comment.