diff --git a/01-overview.qmd b/01-overview.qmd
index 39b0e34..6598efd 100644
--- a/01-overview.qmd
+++ b/01-overview.qmd
@@ -40,9 +40,9 @@ By the end of the module, students should be able to:
To reproduce the code in the book, you need the following software packages:
-- R-4.2.2
-- RStudio 2022.12.0-353
-- Quarto 1.2.280
+- R-4.3.1
+- RStudio 2023.09.0+463
+- Quarto 1.3.450
- the list of libraries in the next section
To check your version of:
@@ -64,6 +64,7 @@ The list of libraries used in this book is provided below:
- `arm`
- `car`
- `corrplot`
+- `devtools`
- `FRK`
- `gghighlight`
- `ggplot2`
@@ -94,15 +95,19 @@ The list of libraries used in this book is provided below:
- `tmap`
- `tufte`
- `viridis`
+- `basemapR`
Copy, paste and run the code below in your console. Ensure all packages are installed on your computer.
```{r}
#| eval: false
-deps <- list(
+
+# package names
+packages <- c(
"arm",
"car",
"corrplot",
+ "devtools",
"FRK",
"gghighlight",
"ggplot2",
@@ -120,7 +125,6 @@ deps <- list(
"merTools",
"plyr",
"RColorBrewer",
- "rgdal",
"sf",
"sjPlot",
"sp",
@@ -133,14 +137,22 @@ deps <- list(
"tufte",
"viridis"
)
-```
-```{r}
-#| eval: false
-# we can load them all to make sure they are installed:
-for(lib in deps){library(lib, character.only = TRUE)}
+# install packages not yet installed
+installed_packages <- packages %in% rownames(installed.packages())
+if (any(installed_packages == FALSE)) {
+ install.packages(packages[!installed_packages])
+}
+
+# packages loading
+invisible(lapply(packages, library, character.only = TRUE))
```
+::: column-margin ::: callout-note To install the library `basemapR`, you need to install from source by running:
+
+`library(devtools)`\
+`install_github('Chrisjb/basemapR')` ::: ::: column-margin
+
## Assessment
The final module mark is composed of the *two computational essays*. Together they are designed to cover the materials introduced in the entirety of content covered during the semester. A computational essay is an essay whose narrative is supported by code and computational results that are included in the essay itself. Each teaching week, you will be required to address a set of questions relating to the module content covered in that week, and to use the material that you will produce for this purpose to build your computational essay.
diff --git a/02-spatial_data.qmd b/02-spatial_data.qmd
index bb367ac..ec3398e 100644
--- a/02-spatial_data.qmd
+++ b/02-spatial_data.qmd
@@ -53,7 +53,7 @@ An alternative approach is to use the smallest geographical system available and
### Ecological Fallacy
-Ecological fallacy is an error in the interpretation of statistical data based on aggregate information. Specifically it refers to inferences made about the nature of specific individuals based solely on statistics aggregated for a given group. It is about thinking that relationships observed for groups necessarily hold for individuals. A key example is @robinson1950ecological who illustrates this problem exploring the difference between ecological correlations and individual correlations. He looked at the relationship between country of birth and literacy. @robinson1950ecological used the percent of foreign-born population and percent of literate population for the 48 states in the United States in 1930. The ecological correlation based on these data was `0.53`. This suggests a positive association between foreign birth and literacy, and could be interpreted as foreign born individuals being more likely to be literate than native-born individuals. Yet, the correlation based on individual data was negative `-0.11` which indicates the opposite. The main point emerging from this example is to carefully interpret analysis based on spatial data and avoid making inferences about individuals from these data.
+Ecological fallacy is an error in the interpretation of statistical data based on aggregate information. Specifically it refers to inferences made about the nature of specific individuals based solely on statistics aggregated for a given group. It is about thinking that relationships observed for groups necessarily hold for individuals. A key example is @robinson1950ecological who illustrates this problem exploring the difference between ecological correlations and individual correlations. He looked at the relationship between country of birth and literacy. @robinson1950ecological used the percent of foreign-born population and percent of literate population for the 48 states in the United States in 1930. The ecological correlation based on these data was 0.53. This suggests a positive association between foreign birth and literacy, and could be interpreted as foreign born individuals being more likely to be literate than native-born individuals. Yet, the correlation based on individual data was negative -0.11 which indicates the opposite. The main point emerging from this example is to carefully interpret analysis based on spatial data and avoid making inferences about individuals from these data.
### Spatial Dependence
diff --git a/03-data-wrangling.qmd b/03-data-wrangling.qmd
index a6766b5..3cdcd9d 100644
--- a/03-data-wrangling.qmd
+++ b/03-data-wrangling.qmd
@@ -1,10 +1,8 @@
# Data Wrangling {#sec-chp3}
-This chapter[^03-data-wrangling-1] introduces computational notebooks, basic functions and data types. These are all important concepts that we will use during the module.
+In this chapter, we will cover the fundamentals of the concepts and functions that you will need to know to navigate this book. We will introduce key concepts and functions relating to what computational notebooks are and how they work. We will also cover basic R functions and data types, including the use of factors. Additionally, we will offer a basic understanding of the manipulation and mapping of spatial data frames using commonly used libraries such as `tidyverse`, `sf`, `ggplot` and `tmap`.
-[^03-data-wrangling-1]: This chapter is part of [Spatial Analysis Notes](index.html)
[Introduction -- R Notebooks + Basic Functions + Data Types]{xmlns:dct="http://purl.org/dc/terms/" property="dct:title"} by Francisco Rowe is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
-
-If you are already familiar with R, R notebooks and data types, you may want to jump to Section [Read Data](#sec_readdata) and start from there. This section describes how to read and manipulate data using `sf` and `tidyverse` functions, including `mutate()`, `%>%` (known as pipe operator), `select()`, `filter()` and specific packages and functions how to manipulate spatial data.
+If you are already familiar with R, R computational notebooks and data types, you may want to jump to Section [Read Data](#sec_readdata) and start from there. This section describes how to read and manipulate data using `sf` and `tidyverse` functions, including `mutate()`, `%>%` (known as pipe operator), `select()`, `filter()` and specific packages and functions how to manipulate spatial data.
The chapter is based on:
@@ -18,30 +16,33 @@ The chapter is based on:
## Dependencies
-This tutorial uses the libraries below. Ensure they are installed on your machine[^03-data-wrangling-2] before loading them executing the following code chunk:
+This chapter uses the libraries below. Ensure they are installed on your machine[^03-data-wrangling-1] before you progress.
+
+[^03-data-wrangling-1]: You can install package `mypackage` by running the command `install.packages("mypackage")` on the R prompt or through the `Tools --> Install Packages...` menu in RStudio.
-[^03-data-wrangling-2]: You can install package `mypackage` by running the command `install.packages("mypackage")` on the R prompt or through the `Tools --> Install Packages...` menu in RStudio.
+```{r}
+#| include = FALSE
+rm(list=ls())
+```
-```{r, message = FALSE}
-# Data manipulation, transformation and visualisation
+```{r}
+#| warning = FALSE
+# data manipulation, transformation and visualisation
library(tidyverse)
-# Nice tables
+# nice tables
library(kableExtra)
-# Simple features (a standardised way to encode vector data ie. points, lines, polygons)
+# spatial data manipulation
library(sf)
-# Spatial objects conversion
-library(sp)
-# Thematic maps
+# thematic mapping
library(tmap)
-# Colour palettes
+# colour palettes
library(RColorBrewer)
-# More colour palettes
library(viridis)
```
## Introducing R
-R is a freely available language and environment for statistical computing and graphics which provides a wide variety of statistical and graphical techniques. It has gained widespread use in academia and industry. R offers a wider array of functionality than a traditional statistics package, such as SPSS and is composed of core (base) functionality, and is expandable through libraries hosted on [CRAN](https://cran.r-project.org). CRAN is a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R.
+R is a freely available language and environment for statistical computing and graphics which provides a wide variety of statistical and graphical techniques. It has gained widespread use in academia and industry. R offers a wider array of functionality than a traditional statistics package, such as SPSS and is composed of core (base) functionality, and is expandable through libraries hosted on [The Comprehensive R Archive Network (CRAN)](https://cran.r-project.org). CRAN is a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R.
Commands are sent to R using either the terminal / command line or the R Console which is installed with R on either Windows or OS X. On Linux, there is no equivalent of the console, however, third party solutions exist. On your own machine, R can be installed from [here](https://www.r-project.org/).
@@ -58,11 +59,16 @@ If you would like to know more about the various features of RStudio, watch this
Before we start any analysis, ensure to set the path to the directory where we are working. We can easily do that with `setwd()`. Please replace in the following line the path to the folder where you have placed this file -and where the `data` folder lives.
```{r}
-#setwd('../data/sar.csv')
-#setwd('.')
+#| eval: false
+setwd('../data/sar.csv')
+setwd('.')
```
-Note: It is good practice to not include spaces when naming folders and files. Use *underscores* or *dots*.
+::: column-margin
+::: callout-note
+It is good practice to not include spaces when naming folders and files. Use *underscores* or *dots*.
+:::
+:::
You can check your current working directory by typing:
@@ -74,6 +80,14 @@ getwd()
An *R script* is a series of commands that you can execute at one time and help you save time. So you do not repeat the same steps every time you want to execute the same process with different datasets. An R script is just a plain text file with R commands in it.
+::: column-margin
+::: callout-note
+To get familiar with good practices in writing your code in R, we recommend the [Chapter Workflow: basics](https://r4ds.hadley.nz/workflow-basics.html) and [Workflow: scripts and projects](https://r4ds.hadley.nz/workflow-scripts) from the R in Data Science book by @wickham2023r.
+:::
+:::
+
+https://r4ds.hadley.nz/workflow-basics.html
+
To create an R script in RStudio, you need to
- Open a new script file: *File* \> *New File* \> *R Script*
@@ -88,7 +102,7 @@ mtcars
- Save the script: *File* \> *Save As*, select your required destination folder, and enter any filename that you like, provided that it ends with the file extension *.R*
-An *R Notebook* is an R Markdown document with descriptive text and code chunks that can be executed independently and interactively, with output visible immediately beneath a code chunk - see @Xie_et_al_2019_book.
+An *R Notebook* or a *Quarto Document* are a Markdown options with descriptive text and code chunks that can be executed independently and interactively, with output visible immediately beneath a code chunk - see @Xie_et_al_2019_book. A *Quarto Document* is an improved version of the original *R Notebook*. *Quarto Document* requires a package called [Quarto](https://quarto.org). Quarto does not have a dependency or requirement for R. Quarto is multilingual, beginning with R, Python, Javascript, and Julia. The concept is that Quarto will work even for languages that do not yet exist. This book was original written in *R Notebook* but later transitioned into *Quarto Documents*.
To create an R Notebook, you need to:
@@ -102,7 +116,7 @@ To create an R Notebook, you need to:
2) use the keyboard shortcut *Ctrl + Alt + I* or *Cmd + Option + I* (Mac); or,
3) type the chunk delimiters ```` ```{r} ```` and ```` ``` ````
-In a chunk code you can produce text output, tables, graphics and write code! You can control these outputs via chunk options which are provided inside the curly brackets eg.
+In a chunk code you can produce text output, tables, graphics and write code! You can control these outputs via chunk options which are provided inside the curly brackets e.g.:
```{r, include=FALSE}
hist(mtcars$mpg)
@@ -116,9 +130,7 @@ hist(mtcars$mpg)
Rstudio also offers a *Preview* option on the toolbar which can be used to create pdf, html and word versions of the notebook. To do this, choose from the drop-down list menu `knit to ...`
-For this module, we will be using computational notebooks through [Quarto](https://quarto.org); that is, *Quarto Document*. *"Quarto is a multi-language, next generation version of R Markdown from RStudio, with many new new features and capabilities. Like R Markdown, Quarto uses Knitr to execute R code, and is therefore able to render most existing Rmd files without modification."*
-
-To create a Quarto Document, you need to:
+To create a *Quarto Document*, you need to:
- Open a new script file: *File* \> *New File* \> *Quarto Document*
@@ -128,13 +140,15 @@ Quarto Documents work in the same way as R Notebooks with small variations. You
You can use `help` or `?` to ask for details for a specific function:
-```{r, eval=FALSE}
-help(sqrt) #or ?sqrt
+```{r}
+#| eval: false
+help(sqrt) #or
+?sqrt
```
And using `example` provides examples for said function:
-```{r, fig.margin = TRUE, fig.cap = 'Example sqrt'}
+```{r}
example(sqrt)
```
@@ -236,7 +250,7 @@ In statistics, we differentiate between data to capture:
In R these three types of random variables are represented by the following types of R data object:
-```{r, echo=FALSE, fig.fullwidth = FALSE, fig.margin = FALSE, fig.cap= 'Survey and R data types'}
+```{r, echo=FALSE}
text_tbl <- data.frame(
variables = c("nominal", "ordinal", "discrete", "continuous"),
@@ -252,9 +266,9 @@ We have already encountered the R data type *numeric*. The next section introduc
**What is a factor?**
-A factor variable assigns a numeric code to each possible category (*level*) in a variable. Behind the scenes, R stores the variable using these numeric codes to save space and speed up computing. For example, compare the size of a list of `10,000` *males* and *females* to a list of `10,000` `1s` and `0s`. At the same time R also saves the category names associated with each numeric code (level). These are used for display purposes.
+A factor variable assigns a numeric code to each possible category (*level*) in a variable. Behind the scenes, R stores the variable using these numeric codes to save space and speed up computing. For example, compare the size of a list of 10,000 *males* and *females* to a list of 10,000 1s and 0s. At the same time R also saves the category names associated with each numeric code (level). These are used for display purposes.
-For example, the variable *gender*, converted to a factor, would be stored as a series of `1s` and `2s`, where `1 = female` and `2 = male`; but would be displayed in all outputs using their category labels of *female* and *male*.
+For example, the variable *gender*, converted to a factor, would be stored as a series of 1s and 2s, where 1 = female and 2 = male; but would be displayed in all outputs using their category labels of *female* and *male*.
**Creating a factor**
@@ -268,13 +282,13 @@ class(gender)
str(gender)
```
-Now *gender* is a factor and is stored as a series of `1s` and `2s`, with `1s` representing `females` and `2s` representing `males`. The function `levels( )` lists the levels (categories) associated with a given factor variable:
+Now *gender* is a factor and is stored as a series of 1s and 2s, with 1s representing females and 2s representing males. The function `levels( )` lists the levels (categories) associated with a given factor variable:
```{r}
levels(gender)
```
-The categories are reported in the order that they have been numbered (starting from `1`). Hence from the output we can infer that `females` are coded as `1`, and `males` as `2`.
+The categories are reported in the order that they have been numbered (starting from 1). Hence from the output we can infer that females are coded as 1, and males as 2.
## Data Frames
@@ -309,13 +323,15 @@ str(df) # or use glimpse(data)
### Referencing Data Frames
-Throughout this module, you will need to refer to particular parts of a dataframe - perhaps a particular column (an area attribute); or a particular subset of respondents. Hence it is worth spending some time now mastering this particular skill.
+To refer to particular parts of a dataframe - say, a particular column (an area attribute), or a subset of respondents. Hence it is worth spending some time understanding how to reference dataframes.
The relevant R function, `[ ]`, has the format `[row,col]` or, more generally, `[set of rows, set of cols]`.
Run the following commands to get a feel of how to extract different slices of the data:
-```{r, eval=FALSE}
+```{r}
+#| eval: false
+
df # whole data.frame
df[1, 1] # contents of first row and column
df[2, 2:3] # contents of the second row, second and third columns
@@ -337,7 +353,8 @@ Run both of these fuctions on their own to get a better understanding of what th
Three other methods for referencing the contents of a data.frame make direct use of the variable names within the data.frame, which tends to make for easier to read/understand code:
-```{r, eval=FALSE}
+```{r}
+#| eval: false
df[,"pop"] # variable name in quotes inside the square brackets
df$pop # variable name prefixed with $ and appended to the data.frame name
# or you can use attach
@@ -359,6 +376,12 @@ Ensure your memory is clear
rm(list=ls()) # rm for targeted deletion / ls for listing all existing objects
```
+::: column-margin
+::: callout-note
+When opening a file, ensure the correct directory set up pointing to your data. It may differ from your existing working directory.
+:::
+:::
+
There are many commands to read / load data onto R. The command to use will depend upon the format they have been saved. Normally they are saved in *csv* format from Excel or other software packages. So we use either:
- `df <- read.table("path/file_name.csv", header = FALSE, sep =",")`
@@ -370,19 +393,16 @@ To read files in other formats, refer to this useful [DataCamp tutorial](https:/
```{r}
census <- read.csv("data/census/census_data.csv")
head(census)
-# NOTE: always ensure your are setting the correct directory leading to the data.
-# It may differ from your existing working directory
+
```
### Quickly inspect the data
-1. What class?
-
-2. What R data types?
+Using the following questions to lead the inspection: What class? What R data types? What data types?
-3. What data types?
+```{r}
+#| eval: false
-```{r, eval=FALSE}
# 1
class(census)
# 2 & 3
@@ -406,7 +426,8 @@ or want to view the data:
Usually you want to add / create new variables to your data frame using existing variables eg. computing percentages by dividing two variables. There are many ways in which you can do this i.e. referecing a data frame as we have done above, or using `$` (e.g. `census$pop`). For this module, we'll use `tidyverse`:
```{r}
-census <- census %>% mutate(per_ghealth = ghealth / pop)
+census <- census %>%
+ mutate( per_ghealth = ghealth / pop )
```
Note we used a *pipe operator* `%>%`, which helps make the code more efficient and readable - more details, see @grolemund_wickham_2019_book. When using the pipe operator, recall to first indicate the data frame before `%>%`.
@@ -418,7 +439,8 @@ Note also the use a variable name before the `=` sign in brackets to indicate th
Usually you want to select a subset of variables for your analysis as storing to large data sets in your R memory can reduce the processing speed of your machine. A selection of data can be achieved by using the `select` function:
```{r}
-ndf <- census %>% select(ward, pop16_74, per_ghealth)
+ndf <- census %>%
+ select( ward, pop16_74, per_ghealth )
```
Again first indicate the data frame and then the variable you want to select to build a new data frame. Note the code chunk above has created a new data frame called `ndf`. Explore it.
@@ -428,14 +450,15 @@ Again first indicate the data frame and then the variable you want to select to
You may also want to filter values based on defined conditions. You may want to filter observations greater than a certain threshold or only areas within a certain region. For example, you may want to select areas with a percentage of good health population over 50%:
```{r}
-ndf2 <- census %>% filter(per_ghealth < 0.5)
+ndf2 <- census %>%
+ filter( per_ghealth < 0.5 )
```
You can use more than one variables to set conditions. Use "`,`" to add a condition.
### Joining Data Drames
-When working with spatial data, we often need to join data. To this end, you need a common unique `id variable`. Let's say, we want to add a data frame containing census data on households for Liverpool, and join the new attributes to one of the existing data frames in the workspace. First we will read the data frame we want to join (ie. `census_data2.csv`).
+When working with spatial data, we often need to join data. To this end, you need a common unique `id variable`. Let's say, we want to add a data frame containing census data on households for Liverpool, and join the new attributes to one of the existing data frames in the workspace. First we will read the data frame we want to join (i.e. `census_data2.csv`).
```{r}
# read data
@@ -448,38 +471,48 @@ The variable `geo_code` in this data frame corresponds to the `code` in the exis
```{r}
# join data frames
-join_dfs <- merge(census, census2, by.x="code", by.y="geo_code", all.x = TRUE)
+join_dfs <- merge( census, # df1
+ census2, # df2
+ by.x="code", by.y="geo_code", # common ids
+ all.x = TRUE)
# check data
head(join_dfs)
```
### Saving Data
-It may also be convinient to save your R projects. They contains all the objects that you have created in your workspace by using the `save.image( )` function:
+It may also be convenient to save your R projects. They contains all the objects that you have created in your workspace by using the `save.image( )` function:
-```{r, eval=FALSE}
+```{r}
+#| eval: false
save.image("week1_envs453.RData")
```
This creates a file labelled "week1_envs453.RData" in your working directory. You can load this at a later stage using the `load( )` function.
-```{r,eval=FALSE}
+```{r}
+#| eval: false
load("week1_envs453.RData")
```
Alternatively you can save / export your data into a `csv` file. The first argument in the function is the object name, and the second: the name of the csv we want to create.
-```{r, eval=FALSE}
+```{r}
+#| eval: false
write.csv(join_dfs, "join_censusdfs.csv")
```
## Using Spatial Data Frames
-A core area of this module is learning to work with spatial data in R. R has various purposedly designed **packages** for manipulation of spatial data and spatial analysis techniques. Various R packages exist in CRAN eg. `spatial`, `sgeostat`, `splancs`, `maptools`, `tmap`, `rgdal`, `spand` and more recent development of `sf` - see @lovelace2019 for a great description and historical context for some of these packages.
+A core area of the module is learning to work with spatial data in R. R has various purposedly designed `packages` for manipulation of spatial data and spatial analysis techniques. Various packages exist in CRAN, including `sf` [@sf2018; @R-sf], `stars` [@R-stars], `terra`, `s2` [@R-s2], `lwgeom` [@R-lwgeom], `gstat` [@pebesma2004; @R-gstat], `spdep` [@R-spdep], `spatialreg` [@R-spatialreg], `spatstat` [@baddeley2015spatial; @R-spatstat], `tmap` [@tmap2018; @R-tmap], `mapview` [@R-mapview] and more. A key package is this ecosystem is `sf` [@pebesma2023spatial]. R package `sf` provides a table format for simple features, where feature geometries are stored in a list-column. It appeared in 2016 and was developed to move spatial data analysis in R closer to standards-based approaches seen in the industry and open source projects, to build upon more modern versions of open source geospatial software stack and allow for integration of R spatial software with the `tidyverse` [@tidyverse2019], particularly `ggplot2`, `dplyr`, and `tidyr`. Hence, this book relies heavely on `sf` for the manipulation and analysis of the data.
-During this session, we will use `sf`.
+::: column-margin
+::: callout-note
+@lovelace2024 provide a helpful overview and evolution of R spatial package ecosystem.
+:::
+:::
-We first need to import our spatial data. We will use a shapefile containing data at Output Area (OA) level for Liverpool. These data illustrates the hierarchical structure of spatial data.
+To read our spatial data, we use the `st_read` function. We read a shapefile containing data at Output Area (OA) level for Liverpool. These data illustrates the hierarchical structure of spatial data.
### Read Spatial Data
@@ -489,7 +522,8 @@ oa_shp <- st_read("data/census/Liverpool_OA.shp")
Examine the input data. A spatial data frame stores a range of attributes derived from a shapefile including the **geometry** of features (e.g. polygon shape and location), **attributes** for each feature (stored in the .dbf), [projection](https://en.wikipedia.org/wiki/Map_projection) and coordinates of the shapefile's bounding box - for details, execute:
-```{r, eval=FALSE}
+```{r}
+#| eval: false
?st_read
```
@@ -504,15 +538,19 @@ str(oa_shp)
head(oa_shp)
```
-**TASK:**
+::: column-margin
+::: {.callout-tip appearance="simple" icon="false"}
+**Task**
-- What are the geographical hierarchy in these data?
-- What is the smallest geography?
-- What is the largest geography?
+- What are the geographical hierarchy in these data?\
+- What is the smallest geography?\
+- What is the largest geography?\
+:::
+:::
### Basic Mapping
-Again, many functions exist in CRAN for creating maps:
+Many functions exist in CRAN for creating maps:
- `plot` to create static maps
- `tmap` to create static and interactive maps
@@ -521,28 +559,83 @@ Again, many functions exist in CRAN for creating maps:
- `ggplot2` to create data visualisations, including static maps
- `shiny` to create web applications, including maps
-Here this notebook demonstrates the use of `plot` and `tmap`. First `plot` is used to map the spatial distribution of non-British-born population in Liverpool. First we only map the geometries on the right,
+In this book, we will make use of `plot`, `tmap` and `ggplot`. Normally you use `plot` to get a quick inspection of the data and `tmap` and `ggplot` to get publication quality data visualisations. First `plot` is used to map the spatial distribution of non-British-born population in Liverpool. First we only map the geometries on the right.
+
+**Using `plot`**
-#### Using `plot`
+We can use the base `plot` function to display the boundaries of OAs in Liverpool.
-```{r, fig.margin = TRUE, fig.cap = 'OAs of Livepool'}
+```{r}
# mapping geometry
plot(st_geometry(oa_shp))
```
-and then:
+To visualise a column in the spatial data frame, we can run:
-```{r, fig.cap = 'Spatial distribution of ethnic groups, Liverpool'}
+```{r}
# map attributes, adding intervals
-plot(oa_shp["Ethnic"], key.pos = 4, axes = TRUE, key.width = lcm(1.3), key.length = 1.,
- breaks = "jenks", lwd = 0.1, border = 'grey')
+plot(oa_shp["Ethnic"], # variable to visualise
+ key.pos = 4,
+ axes = TRUE,
+ key.width = lcm(1.3),
+ key.length = 1.,
+ breaks = "jenks", # algorithm to categorise the data
+ lwd = 0.1,
+ border = 'grey') # boundary colour
```
-**TASK:**
+::: column-margin
+::: {.callout-tip appearance="simple" icon="false"}
+**Task**
+
+What is the key pattern emerging from this map?
+:::
+:::
-- What is the key pattern emerging from this map?
+Let us now explore `ggplot` or `tmap`.
-#### Using `tmap`
+**Using `ggplot`**
+
+We can visualise spatial data frames using `ggplot` fairly easily. `ggplot` is a generic sets of functions which was not specifically designed for spatial mapping, but it is fairly flexible that allows producing great spatial data visualisations.
+
+Following the grammar of `ggplot`, we plot spatial data drawing layers. `ggplot` has a basic structure of three components:
+
+- The data i.e. `ggplot( data = *data frame*)`.
+- Geometries i.e. `geom_xxx( )`.
+- Aesthetic mapping i.e. `aes(x=*variable*, y=*variable*)`
+
+We can put these three components together using `+`. This is similar to the ways we apply the pipe operator in for tidyverse. The latter component of aesthetic mapping can be added in both the `ggplot` and `geom_xxx` functions.
+
+To map our data, we can then run:
+
+```{r}
+ggplot(data = oa_shp, aes( fill = Ethnic) ) + # add data frame and variable to map
+ geom_sf(colour = "gray60", # colour line
+ size = 0.1) # line size
+```
+
+We can change the colour palette by using a different colour palette. We can use the [viridis](https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html) package. The power of `viridis` is that it uses color scales that visually pleasing, colorblind friendly, print well in gray scale, and can be used for both categorical and continuous data. For categorical data you can also use [ColourBrewer](https://colorbrewer2.org/#type=sequential&scheme=BuGn&n=3).
+
+So let us (1) change the colour pallete to viridis, (2) remove the colour of the boundaries, and (3) replace the theme with the theme we will be using for the book. Let's read the theme first and then implement these changes.
+
+The `ggplot` themes for the book are in a file called `data-visualisation_theme.R` in the `style` folder. We read this file by running:
+
+```{r}
+source("./style/data-visualisation_theme.R")
+```
+
+We can now implement our changes:
+
+```{r}
+ggplot(data = oa_shp, aes( fill = Ethnic) ) + # add data frame and variable to map
+ geom_sf(colour = "transparent") + # colour line
+ scale_fill_viridis( option = "viridis" ) + # add viridis colour scheme
+ theme_map_tufte()
+```
+
+To master `ggplot`, see @wickham2009.
+
+**Using `tmap`**
Similar to `ggplot2`, `tmap` is based on the idea of a 'grammar of graphics' which involves a separation between the input data and aesthetics (i.e. the way data are visualised). Each data set can be mapped in various different ways, including location as defined by its geometry, colour and other features. The basic building block is `tm_shape()` (which defines input data), followed by one or more layer elements such as `tm_fill()` and `tm_dots()`.
@@ -560,22 +653,34 @@ map_oa = tm_shape(oa_shp) +
map_oa
```
-Note that the operation `+` is used to add new layers. You can set style themes by `tm_style`. To visualise the existing styles use `tmap_style_catalogue()`, and you can also evaluate the code chunk below if you would like to create an interactive map.
+Note that the operation `+`, as for `ggplot` is used to add new layers. You can set style themes by `tm_style`. To visualise the existing styles use `tmap_style_catalogue()`. An advantage of `tmap` is that you can easily create an interactive map by running `tmap_mode`.
-```{r, eval=FALSE}
+```{r}
tmap_mode("view")
map_oa
```
-**TASK:**
+::: column-margin
+::: {.callout-tip appearance="simple" icon="false"}
+**Task**
-- Try mapping other variables in the spatial data frame. Where do population aged 60 and over concentrate?
+Try mapping other variables in the spatial data frame. Where do population aged 60 and over concentrate?
+:::
+:::
### Comparing geographies
-If you recall, one of the key issues of working with spatial data is the modifiable area unit problem (MAUP) - see lecture notes. To get a sense of the effects of MAUP, we analyse differences in the spatial patterns of the ethnic population in Liverpool between Middle Layer Super Output Areas (MSOAs) and OAs. So we map these geographies together.
+If you recall, one of the key issues of working with spatial data is the modifiable area unit problem (MAUP) - see @spatial_data. To get a sense of the effects of MAUP, we analyse differences in the spatial patterns of the ethnic population in Liverpool between Middle Layer Super Output Areas (MSOAs) and OAs. So we map these geographies together.
+
+::: column-margin
+::: callout-note
+The first line of the chunk code include `tmap_mode("plot")` which tells R that we want a static map. `tmap_mode` works like a switch to interactive and non-interactive mapping.
+:::
+:::
```{r}
+tmap_mode("plot")
+
# read data at the msoa level
msoa_shp <- st_read("data/census/Liverpool_MSOA.shp")
@@ -593,23 +698,12 @@ map_msoa = tm_shape(msoa_shp) +
tmap_arrange(map_msoa, map_oa)
```
-**TASK:**
-
-- What differences do you see between OAs and MSOAs?
-- Can you identify areas of spatial clustering? Where are they?
+::: column-margin
+::: {.callout-tip appearance="simple" icon="false"}
+**Task**
-## Useful Functions
+What differences do you see between OAs and MSOAs?
-| Function | Description |
-|--------------------------------|----------------------------------------|
-| read.csv() | read csv files into data frames |
-| str() | inspect data structure |
-| mutate() | create a new variable |
-| filter() | filter observations based on variable values |
-| %\>% | pipe operator - chain operations |
-| select() | select variables |
-| merge() | join dat frames |
-| st_read | read spatial data (ie. shapefiles) |
-| plot() | create a map based a spatial data set |
-| tm_shape(), tm_fill(), tm_borders() | create a map using tmap functions |
-| tm_arrange | display multiple maps in a single "metaplot" |
+Can you identify areas of spatial clustering? Where are they?
+:::
+:::
diff --git a/04-points.qmd b/04-points.qmd
index 3f35fbc..e8166d1 100644
--- a/04-points.qmd
+++ b/04-points.qmd
@@ -10,23 +10,25 @@ This chapter is based on the following references, which are great follow-up's o
We will rely on the following libraries in this section, all of them included in @sec-dependencies:
-```{r results='hide', , warning=FALSE, message=FALSE}
-# For pretty table
-library(knitr)
-# All things geodata
+```{r}
+#| include = FALSE
+source("./style/data-visualisation_theme.R")
+```
+
+```{r}
+#| warning = FALSE
+# data manipulation, transformation and visualisation
+library(tidyverse)
+# spatial data manipulation
library(sf)
library(sp)
-# Pretty graphics
-library(ggplot2)
+# data visualisation
library(gridExtra)
-# Thematic maps
-library(tmap)
-# Pretty maps
-library(ggmap)
-# For all your interpolation needs
+# basemap
+library(basemapR)
+# interpolation
library(gstat)
-# For data manipulation
-library(plyr)
+library(hexbin)
```
Before we start any analysis, let us set the path to the directory where we are working. We can easily do that with `setwd()`. Please replace in the following line the path to the folder where you have placed this file -and where the `house_transactions` folder with the data lives.
@@ -51,21 +53,19 @@ The rest of this session will focus on two main elements of the table: the spati
```{r fig.margin=TRUE, fig.cap="Raw AirBnb prices in San Diego"}
# Create the histogram
-hist <- qplot(data=db,x=price)
-hist
+qplot( data = db, x = price)
```
This basically shows there is a lot of values concentrated around the lower end of the distribution but a few very large ones. A usual transformation to *shrink* these differences is to take logarithms. The original table already contains an additional column with the logarithm of each price (`log_price`).
-```{r fig.margin=TRUE, fig.cap="Log of AirBnb price in San Diego"}
+```{r}
# Create the histogram
-hist <- qplot(data=db, x=log_price)
-hist
+qplot( data = db, x = log_price )
```
To obtain the spatial distribution of these houses, we need to focus on the `geometry` column. The easiest, quickest (and also "dirtiest") way to get a sense of what the data look like over space is using `plot`:
-```{r fig.margin=TRUE, fig.cap="Spatial distribution of AirBnb in San Diego"}
+```{r}
plot(st_geometry(db))
```
@@ -78,26 +78,29 @@ The two-dimensional sister of histograms are binning maps: we divide each of the
```{r}
# Squared binning
# Set up plot
-sqbin <- ggplot() +
+sqbin <- ggplot( ) +
# Add 2D binning with the XY coordinates as
# a dataframe
geom_bin2d(
- data=as.data.frame(st_coordinates(db)),
- aes(x=X, y=Y)
- )
+ data = as.data.frame( st_coordinates( db ) ),
+ aes( x = X, y = Y)
+ ) +
+ # set theme
+ theme_plot_tufte()
# Hex binning
# Set up plot
hexbin <- ggplot() +
# Add hex binning with the XY coordinates as
# a dataframe
geom_hex(
- data=as.data.frame(st_coordinates(db)),
- aes(x=X, y=Y)
+ data = as.data.frame( st_coordinates( db ) ),
+ aes( x = X, y = Y)
) +
# Use viridis for color encoding (recommended)
- scale_fill_continuous(type = "viridis")
+ scale_fill_continuous( type = "viridis" ) +
+ theme_plot_tufte()
# Bind in subplots
-grid.arrange(sqbin, hexbin, ncol=2)
+grid.arrange( sqbin, hexbin, ncol = 2 )
```
## KDE
@@ -108,7 +111,7 @@ Kernel Density Estimation (KDE) is a technique that creates a *continuous* repre
KDE over a single dimension is essentially a contiguous version of a histogram. We can see that by overlaying a KDE on top of the histogram of logs that we have created before:
-```{r fig.fullwidth=TRUE, fig.cap="Histogram and KDE of the log of AirBnb prices in San Diego"}
+```{r}
# Create the base
base <- ggplot(db, aes(x=log_price))
# Histogram
@@ -116,7 +119,8 @@ hist <- base +
geom_histogram(bins=50, aes(y=..density..))
# Overlay density plot
kde <- hist +
- geom_density(fill="#FF6666", alpha=0.5, colour="#FF6666")
+ geom_density(fill="#FF6666", alpha=0.5, colour="#FF6666") +
+ theme_plot_tufte()
kde
```
@@ -128,44 +132,35 @@ Geography, at the end of the day, is usually represented as a two-dimensional sp
To create a spatial KDE in R, we can use general tooling for non-spatial points, such as the `stat_density2d_filled` method:
-```{r fig.margin=TRUE, fig.cap="KDE of AirBnb properties in San Diego"}
+```{r}
# Create the KDE surface
kde <- ggplot(data = db) +
- stat_density2d_filled(alpha = 0.5,
+ stat_density2d_filled(alpha = 1,
data = as.data.frame(st_coordinates(db)),
aes(x = X, y = Y),
- n = 16
+ n = 100
) +
# Tweak the color gradient
scale_color_viridis_c() +
# White theme
- theme_bw()
-# Tip! Add invisible points to improve proportions
-kde + geom_sf(alpha=0)
+ theme_plot_tufte()
+kde
```
-This approach generates a surface that represents the density of dots, that is an estimation of the probability of finding a house transaction at a given coordinate. However, without any further information, they are hard to interpret and link with previous knowledge of the area. To bring such context to the figure, we can plot an underlying basemap, using a cloud provider such as Google Maps or, as in this case, OpenStreetMap. To do it, we will leverage the library `ggmap`, which is designed to play nicely with the `ggplot2` family (hence the seemingly counterintuitive example above). Before we can plot them with the online map, we need to reproject them though.
-
-```{r fig.fullwidth=TRUE, fig.cap="KDE of AirBnb properties in San Diego", warning=FALSE, message=FALSE}
-# Reproject coordinates
-lon_lat <- st_transform(db, crs = 4326) %>%
- st_coordinates() %>%
- as.data.frame()
-# Basemap
-qmplot(
- X,
- Y,
- data = lon_lat,
- geom="blank"
-) +
- # KDE
- stat_density2d_filled(alpha = 0.5,
- data = lon_lat,
+This approach generates a surface that represents the density of dots, that is an estimation of the probability of finding a house transaction at a given coordinate. However, without any further information, they are hard to interpret and link with previous knowledge of the area. To bring such context to the figure, we can plot an underlying basemap, using a cloud provider such as Google Maps or, as in this case, OpenStreetMap. To do it, we will leverage the library `basemapR`, which is designed to play nicely with the `ggplot2` family (hence the seemingly counterintuitive example above). Before we can plot them with the online map, we need to reproject them though.
+
+```{r}
+
+bbox_db <- st_bbox(db)
+ggplot() +
+ base_map(bbox_db, increase_zoom = 2, basemap = "positron") +
+ #geom_sf(data = db, fill = NA) +
+ stat_density2d_filled(alpha = 0.7,
+ data = as.data.frame(st_coordinates(db)),
aes(x = X, y = Y),
- n = 16
- ) +
- # Tweak the color gradient
- scale_color_viridis_c()
+ n = 100
+ )
+
```
## Spatial Interpolation
diff --git a/05-flows.qmd b/05-flows.qmd
index 561361f..1158dff 100644
--- a/05-flows.qmd
+++ b/05-flows.qmd
@@ -13,7 +13,9 @@ Content is based on the following references, which are great follow-up's on the
We will rely on the following libraries in this section, all of them included in @sec-dependencies:
-```{r results='hide', message = FALSE, warning=FALSE}
+```{r}
+#| message: false
+#| warning: false
# Data management
library(tidyverse)
# Spatial Data management
@@ -23,20 +25,14 @@ library(sp)
library(ggplot2)
# Thematic maps
library(tmap)
-# Pretty maps
-library(ggmap)
+# Add basemaps
+library(basemapR)
# Simulation methods
library(arm)
```
In this chapter we will show a slightly different way of managing spatial data in R. Although most of the functionality will be similar to that seen in previous chapters, we will not rely on the "`sf` stack" and we will instead show how to read and manipulate data using the more traditional `sp` stack. Although this approach is being slowly phased out, it is still important to be aware of its existence and its differences with more modern approaches.
-Before we start any analysis, let us set the path to the directory where we are working. We can easily do that with `setwd()`. Please replace in the following line the path to the folder where you have placed this file -and where the `sf_bikes` folder with the data lives.
-
-```{r}
-setwd('.')
-```
-
## Data
In this note, we will use data from the city of San Francisco representing bike trips on their public bike share system. The original source is the SF Open Data portal ([link](http://www.bayareabikeshare.com/open-data)) and the dataset comprises both the location of each station in the Bay Area as well as information on trips (station of origin to station of destination) undertaken in the system from September 2014 to August 2015 and the following year. Since this note is about modeling and not data preparation, a cleanly reshaped version of the data, together with some additional information, has been created and placed in the `sf_bikes` folder. The data file is named `flows.geojson` and, in case you are interested, the (Python) code required to created from the original files in the SF Data Portal is also available on the `flows_prep.ipynb` notebook [\[url\]](https://github.com/darribas/spa_notes/blob/master/sf_bikes/flows_prep.ipynb), also in the same folder.
@@ -45,8 +41,6 @@ Let us then directly load the file with all the information necessary:
```{r}
db <- st_read('./data/sf_bikes/flows.geojson')
-#rownames(db@data) <- db$flow_id
-#db@data$flow_id <- NULL
```
Note how the interface is slightly different since we are reading a `GeoJSON` file instead of a shapefile.
@@ -63,13 +57,13 @@ where `orig` and `dest` are the station IDs of the origin and destination, `stre
The easiest way to get a quick preview of what the data looks like spatially is to make a simple plot:
-```{r fig.margin=TRUE, fig.cap="Potential routes"}
+```{r}
plot(db$geometry)
```
Equally, if we want to visualize a single route, we can simply subset the table. For example, to get the shape of the trip from station `39` to station `48`, we can:
-```{r fig.margin=TRUE, fig.cap="Trip from station 39 to 48"}
+```{r}
db %>%
dplyr::filter(orig == 39 & dest == 48) %>%
ggplot(data = .) +
@@ -82,52 +76,52 @@ db %>%
or, for the most popular route, we can:
-```{r fig.margin=TRUE, fig.cap="Most popular trip"}
+```{r}
most_pop <- db %>%
dplyr::filter(trips15 == max(trips15))
```
These however do not reveal a lot: there is no geographical context (*why are there so many routes along the NE?*) and no sense of how volumes of bikers are allocated along different routes. Let us fix those two.
-The easiest way to bring in geographical context is by overlaying the routes on top of a background map of tiles downloaded from the internet. Let us download this using `ggmap`:
+The easiest way to bring in geographical context is by overlaying the routes on top of a background map of tiles downloaded from the internet. Let us download this using `basemapR`:
-```{r, message = FALSE, warning=FALSE}
+```{r}
+# create a bounding box
bbox_db <- st_bbox(db)
-names(bbox_db) <- c("left", "bottom", "right", "top")
+# download a basemap using ggplot and basemapR
+ggplot() +
+ base_map(bbox_db, increase_zoom = 2, basemap = "positron") +
+ geom_sf(data = db, fill = NA, colour = "transparent")
-SanFran <- get_stamenmap(
- bbox_db,
- zoom = 14,
- maptype = "toner-lite"
- )
-
-```
-
-and make sure it looks like we intend it to look:
-
-```{r fig.margin=TRUE}
-ggmap(SanFran)
```
Now to combine tiles and routes, we need to pull out the coordinates that make up each line. For the route example above, this would be:
-```{r fig.margin=TRUE}
+```{r}
xys1 <- as.data.frame(st_coordinates(most_pop))
```
-Now we can plot the route[^05-flows-1] (note we also dim down the background to focus the attention on flows):
+Now we can plot the route (note we also dim down the background to focus the attention on flows):
-[^05-flows-1]: **EXERCISE**: *can you plot the route for the largest climb?*
+::: column-margin
+::: {.callout-tip appearance="simple" icon="false"}
+**Task**
-```{r fig.margin=TRUE}
-ggmap(SanFran, darken=0.5) +
- geom_path(
- aes(x=X, y=Y),
- data=xys1,
- size=1,
- color=rgb(0.996078431372549, 0.7019607843137254, 0.03137254901960784),
- lineend='round'
- )
+Can you plot the route for the largest climb?
+:::
+:::
+
+```{r}
+#| warning: false
+ggplot() +
+ base_map(bbox_db, increase_zoom = 2, basemap = "dark") +
+ geom_sf( data = db, fill = NA, colour = "transparent") +
+ geom_path( data = xys1,
+ aes( x = X, y = Y ),
+ #size = 1,
+ color = "green",
+ lineend ='round'
+ )
```
Now we can plot all of the lines by using a short `for` loop to build up the table:
@@ -157,44 +151,57 @@ for(x in 1:nrow(db)){
Now we can go on and plot all of them:
-```{r fig.margin=TRUE}
-ggmap(SanFran, darken=0.75) +
- geom_path(
- aes(x=lon, y=lat, group=id),
- data=lines,
- size=0.1,
- color=rgb(0.996078431372549, 0.7019607843137254, 0.03137254901960784),
- lineend='round'
- )
+```{r}
+#| warning: false
+ggplot() +
+ # call basemap
+ base_map(bbox_db, increase_zoom = 2, basemap = "dark") +
+ geom_sf(data = db, fill = NA, colour = "transparent") +
+ # add data
+ geom_path( data = lines,
+ aes(x=lon, y=lat,
+ group=id
+ ),
+ size = 1,
+ color = "green",
+ lineend = 'round')
+
```
Finally, we can get a sense of the distribution of the flows by associating a color gradient to each flow based on its number of trips:
-```{r fig.fullwidth=TRUE}
-ggmap(SanFran, darken=0.75) +
- geom_path(
- aes(x=lon, y=lat, group=id, colour=trips),
- data=lines,
- size=log1p(lines$trips / max(lines$trips)),
- lineend='round'
- ) +
+```{r}
+#| message: false
+#| warning: false
+ggplot() +
+ # call basemap
+ base_map(bbox_db, increase_zoom = 2, basemap = "dark") +
+ geom_sf(data = db, fill = NA, colour = "transparent") +
+ # add flow data
+ geom_path( data = lines,
+ aes(x = lon, y = lat, group = id, colour = trips ),
+ size=log1p(lines$trips / max(lines$trips)),
+ lineend = 'round'
+ ) +
+ # create a colour palette
scale_colour_gradient(
- low='grey', high='#07eda0'
+ low='#440154FF', high='#FDE725FF'
) +
theme(
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank()
)
+
```
Note how we transform the size so it's a proportion of the largest trip and then it is compressed with a logarithm.
## Modelling flows
-Now we have an idea of the spatial distribution of flows, we can begin to think about modeling them. The core idea in this section is to fit a model that can capture the particular characteristics of our variable of interest (the volume of trips) using a set of predictors that describe the nature of a given flow. We will start from the simplest model and then progressively build complexity until we get to a satisfying point. Along the way, we will be exploring each model using concepts from @gelman2006data such as predictive performance checks[^05-flows-2] (PPC)
+Now we have an idea of the spatial distribution of flows, we can begin to think about modeling them. The core idea in this section is to fit a model that can capture the particular characteristics of our variable of interest (the volume of trips) using a set of predictors that describe the nature of a given flow. We will start from the simplest model and then progressively build complexity until we get to a satisfying point. Along the way, we will be exploring each model using concepts from @gelman2006data such as predictive performance checks[^05-flows-1] (PPC)
-[^05-flows-2]: For a more elaborate introduction to PPC, have a look at Chapters 7 and 8.
+[^05-flows-1]: For a more elaborate introduction to PPC, have a look at Chapters 7 and 8.
Before we start running regressions, let us first standardize the predictors so we can interpret the intercept as the average flow when all the predictors take the average value, and so we can interpret the model coefficients as changes in standard deviation units:
@@ -222,14 +229,14 @@ summary(m1)
To explore how good this model is, we will be comparing the predictions the model makes about the number of trips each flow should have with the actual number of trips. A first approach is to simply plot the distribution of both variables:
-```{r fig.margin=TRUE}
+```{r}
plot(
- density(m1$fitted.values),
- xlim=c(-100, max(db_std$trips15)),
+ density( m1$fitted.values ),
+ xlim = c(-100, max( db_std$trips15 )),
main=''
)
lines(
- density(db_std$trips15),
+ density( db_std$trips15 ),
col='red',
main=''
)
@@ -240,6 +247,7 @@ legend(
lwd=1
)
title(main="Predictive check, point estimates - Baseline model")
+
```
The plot makes pretty obvious that our initial model captures very few aspects of the distribution we want to explain. However, we should not get too attached to this plot just yet. What it is showing is the distribution of predicted *point* estimates from our model. Since our model is not deterministic but inferential, there is a certain degree of uncertainty attached to its predictions, and that is completely absent from this plot.
@@ -257,62 +265,62 @@ Technically speaking, to do this, we need to build a mechanism to obtain a possi
```{r}
generate_draw <- function(m){
# Set up predictors matrix
- x <- model.matrix(m)
+ x <- model.matrix( m )
# Obtain draws of parameters (inferential uncertainty)
- sim_bs <- sim(m, 1)
+ sim_bs <- sim( m, 1)
# Predicted value
mu <- x %*% sim_bs@coef[1, ]
# Draw
- n <- length(mu)
- y_hat <- rnorm(n, mu, sim_bs@sigma[1])
+ n <- length( mu )
+ y_hat <- rnorm( n, mu, sim_bs@sigma[1])
return(y_hat)
}
```
This function takes a model `m` and the set of covariates `x` used and returns a random realization of predictions from the model. To get a sense of how this works, we can get and plot a realization of the model, compared to the expected one and the actual values:
-```{r fig.margin=TRUE}
+```{r}
new_y <- generate_draw(m1)
plot(
- density(m1$fitted.values),
- xlim=c(-100, max(db_std$trips15)),
- ylim=c(0, max(c(
- max(density(m1$fitted.values)$y),
- max(density(db_std$trips15)$y)
+ density( m1$fitted.values ),
+ xlim = c(-100, max( db_std$trips15 )),
+ ylim = c(0, max(c(
+ max( density( m1$fitted.values)$y ),
+ max( density( db_std$trips15)$y )
)
)
),
- col='black',
- main=''
+ col = 'black',
+ main = ''
)
lines(
- density(db_std$trips15),
- col='red',
- main=''
+ density( db_std$trips15 ),
+ col = 'red',
+ main = ''
)
lines(
- density(new_y),
- col='green',
- main=''
+ density( new_y ),
+ col = 'green',
+ main = ''
)
legend(
'topright',
c('Predicted', 'Actual', 'Simulated'),
- col=c('black', 'red', 'green'),
- lwd=1
+ col = c( 'black', 'red', 'green' ),
+ lwd = 1
)
```
Once we have this "draw engine", we can set it to work as many times as we want using a simple `for` loop. In fact, we can directly plot these lines as compared to the expected one and the trip count:
-```{r fig.fullwidth=TRUE}
+```{r}
plot(
- density(m1$fitted.values),
- xlim=c(-100, max(db_std$trips15)),
- ylim=c(0, max(c(
- max(density(m1$fitted.values)$y),
- max(density(db_std$trips15)$y)
+ density( m1$fitted.values ),
+ xlim = c(-100, max( db_std$trips15 )),
+ ylim = c(0, max(c(
+ max( density( m1$fitted.values)$y ),
+ max( density( db_std$trips15)$y )
)
)
),
@@ -321,45 +329,45 @@ plot(
)
# Loop for realizations
for(i in 1:250){
- tmp_y <- generate_draw(m1)
- lines(density(tmp_y),
- col='grey',
- lwd=0.1
+ tmp_y <- generate_draw( m1 )
+ lines( density( tmp_y ),
+ col = 'grey',
+ lwd = 0.1
)
}
#
lines(
- density(m1$fitted.values),
- col='black',
- main=''
+ density( m1$fitted.values ),
+ col = 'black',
+ main = ''
)
lines(
- density(db_std$trips15),
- col='red',
- main=''
+ density( db_std$trips15 ),
+ col = 'red',
+ main = ''
)
legend(
'topright',
c('Actual', 'Predicted', 'Simulated (n=250)'),
- col=c('red', 'black', 'grey'),
- lwd=1
+ col = c('red', 'black', 'grey'),
+ lwd = 1
)
title(main="Predictive check - Baseline model")
```
-The plot shows there is a significant mismatch between the fitted values, which are much more concentrated around small positive values, and the realizations of our "inferential engine", which depict a much less concentrated distribution of values. This is likely due to the combination of two different reasons: on the one hand, the accuracy of our estimates may be poor, causing them to jump around a wide range of potential values and hence resulting in very diverse predictions (inferential uncertainty); on the other hand, it may be that the amount of variation we are not able to account for in the model[^05-flows-3] is so large that the degree of uncertainty contained in the error term of the model is very large, hence resulting in such a flat predictive distribution.
+The plot shows there is a significant mismatch between the fitted values, which are much more concentrated around small positive values, and the realizations of our "inferential engine", which depict a much less concentrated distribution of values. This is likely due to the combination of two different reasons: on the one hand, the accuracy of our estimates may be poor, causing them to jump around a wide range of potential values and hence resulting in very diverse predictions (inferential uncertainty); on the other hand, it may be that the amount of variation we are not able to account for in the model[^05-flows-2] is so large that the degree of uncertainty contained in the error term of the model is very large, hence resulting in such a flat predictive distribution.
-[^05-flows-3]: The $R^2$ of our model is around 2%
+[^05-flows-2]: The $R^2$ of our model is around 2%
-It is important to keep in mind that the issues discussed in the paragraph above relate only to the uncertainty behind our model, not to the point predictions derived from them, which are a mechanistic result of the minimization of the squared residuals and hence are not subject to probability or inference. That allows them in this case to provide a fitted distribution much more accurate apparently (black line above). However, the lesson to take from this model is that, even if the point predictions (fitted values) are artificially accurate[^05-flows-4], our capabilities to infer about the more general underlying process are fairly limited.
+It is important to keep in mind that the issues discussed in the paragraph above relate only to the uncertainty behind our model, not to the point predictions derived from them, which are a mechanistic result of the minimization of the squared residuals and hence are not subject to probability or inference. That allows them in this case to provide a fitted distribution much more accurate apparently (black line above). However, the lesson to take from this model is that, even if the point predictions (fitted values) are artificially accurate[^05-flows-3], our capabilities to infer about the more general underlying process are fairly limited.
-[^05-flows-4]: which they are not really, in light of the comparison between the black and red lines.
+[^05-flows-3]: which they are not really, in light of the comparison between the black and red lines.
**Improving the model**
-The bad news from the previous section is that our initial model is not great at explaining bike trips. The good news is there are several ways in which we can improve this. In this section we will cover three main extensions that exemplify three different routes you can take when enriching and augmenting models in general, and spatial interaction ones in particular[^05-flows-5]. These three routes are aligned around the following principles:
+The bad news from the previous section is that our initial model is not great at explaining bike trips. The good news is there are several ways in which we can improve this. In this section we will cover three main extensions that exemplify three different routes you can take when enriching and augmenting models in general, and spatial interaction ones in particular[^05-flows-4]. These three routes are aligned around the following principles:
-[^05-flows-5]: These principles are general and can be applied to pretty much any modeling exercise you run into. The specific approaches we take in this note relate to spatial interaction models
+[^05-flows-4]: These principles are general and can be applied to pretty much any modeling exercise you run into. The specific approaches we take in this note relate to spatial interaction models
1. Use better approximations to model your dependent variable.
2. Recognize the structure of your data.
@@ -367,9 +375,9 @@ The bad news from the previous section is that our initial model is not great at
- **Use better approximations to model your dependent variable**
-Standard OLS regression assumes that the error term and, since the predictors are deterministic, the dependent variable are distributed following a normal (gaussian) distribution. This is usually a good approximation for several phenomena of interest, but maybe not the best one for trips along routes: for one, we know trips cannot be negative, which the normal distribution does not account for[^05-flows-6]; more subtly, their distribution is not really symmetric but skewed with a very long tail on the right. This is common in variables that represent counts and that is why usually it is more appropriate to fit a model that relies on a distribution different from the normal.
+Standard OLS regression assumes that the error term and, since the predictors are deterministic, the dependent variable are distributed following a normal (gaussian) distribution. This is usually a good approximation for several phenomena of interest, but maybe not the best one for trips along routes: for one, we know trips cannot be negative, which the normal distribution does not account for[^05-flows-5]; more subtly, their distribution is not really symmetric but skewed with a very long tail on the right. This is common in variables that represent counts and that is why usually it is more appropriate to fit a model that relies on a distribution different from the normal.
-[^05-flows-6]: For an illustration of this, consider the amount of probability mass to the left of zero in the predictive checks above.
+[^05-flows-5]: For an illustration of this, consider the amount of probability mass to the left of zero in the predictive checks above.
One of the most common distributions for this cases is the Poisson, which can be incorporated through a general linear model (or GLM). The underlying assumption here is that instead of $T_{ij} \sim N(\mu_{ij}, \sigma)$, our model now follows:
@@ -389,31 +397,31 @@ m2 <- glm(
Now let's see how much better, if any, this approach is. To get a quick overview, we can simply plot the point predictions:
-```{r fig.margin=TRUE}
+```{r}
plot(
- density(m2$fitted.values),
- xlim=c(-100, max(db_std$trips15)),
- ylim=c(0, max(c(
- max(density(m2$fitted.values)$y),
- max(density(db_std$trips15)$y)
+ density( m2$fitted.values ),
+ xlim = c( -100, max( db_std$trips15 )),
+ ylim = c( 0, max(c(
+ max( density( m2$fitted.values)$y ),
+ max( density(db_std$trips15)$y )
)
)
),
- col='black',
- main=''
+ col = 'black',
+ main = ''
)
lines(
- density(db_std$trips15),
- col='red',
- main=''
+ density( db_std$trips15 ),
+ col = 'red',
+ main = ''
)
legend(
'topright',
c('Predicted', 'Actual'),
- col=c('black', 'red'),
- lwd=1
+ col = c('black', 'red'),
+ lwd = 1
)
-title(main="Predictive check, point estimates - Poisson model")
+title(main = "Predictive check, point estimates - Poisson model")
```
To incorporate uncertainty to these predictions, we need to tweak our `generate_draw` function so it accommodates the fact that our model is not linear anymore.
@@ -421,16 +429,16 @@ To incorporate uncertainty to these predictions, we need to tweak our `generate_
```{r}
generate_draw_poi <- function(m){
# Set up predictors matrix
- x <- model.matrix(m)
+ x <- model.matrix( m )
# Obtain draws of parameters (inferential uncertainty)
- sim_bs <- sim(m, 1)
+ sim_bs <- sim( m, 1 )
# Predicted value
xb <- x %*% sim_bs@coef[1, ]
#xb <- x %*% m$coefficients
# Transform using the link function
- mu <- exp(xb)
+ mu <- exp( xb )
# Obtain a random realization
- y_hat <- rpois(n=length(mu), lambda=mu)
+ y_hat <- rpois( n = length( mu ), lambda = mu)
return(y_hat)
}
```
@@ -439,44 +447,44 @@ And then we can examine both point predictions an uncertainty around them:
```{r fig.fullwidth=TRUE}
plot(
- density(m2$fitted.values),
- xlim=c(-100, max(db_std$trips15)),
- ylim=c(0, max(c(
- max(density(m2$fitted.values)$y),
- max(density(db_std$trips15)$y)
+ density( m2$fitted.values ),
+ xlim = c(-100, max( db_std$trips15 )),
+ ylim = c(0, max(c(
+ max( density( m2$fitted.values)$y ),
+ max( density( db_std$trips15)$y )
)
)
),
- col='white',
- main=''
+ col = 'white',
+ main = ''
)
# Loop for realizations
for(i in 1:250){
- tmp_y <- generate_draw_poi(m2)
+ tmp_y <- generate_draw_poi( m2 )
lines(
- density(tmp_y),
- col='grey',
- lwd=0.1
+ density( tmp_y ),
+ col = 'grey',
+ lwd = 0.1
)
}
#
lines(
- density(m2$fitted.values),
- col='black',
- main=''
+ density( m2$fitted.values ),
+ col = 'black',
+ main = ''
)
lines(
- density(db_std$trips15),
- col='red',
- main=''
+ density( db_std$trips15 ),
+ col = 'red',
+ main = ''
)
legend(
'topright',
c('Predicted', 'Actual', 'Simulated (n=250)'),
- col=c('black', 'red', 'grey'),
- lwd=1
+ col = c('black', 'red', 'grey'),
+ lwd = 1
)
-title(main="Predictive check - Poisson model")
+title( main = "Predictive check - Poisson model")
```
Voila! Although the curve is still a bit off, centered too much to the right of the actual data, our predictive simulation leaves the fitted values right in the middle. This speaks to a better fit of the model to the actual distribution othe original data follow.
@@ -491,47 +499,47 @@ $$
T_{ij} = X_{ij}\beta + \delta_i + \delta_j + \epsilon_{ij}
$$
-where $\delta_i$ and $\delta_j$ are origin and destination station fixed effects[^05-flows-7], and the rest is as above. This strategy accounts for all the unobserved heterogeneity associated with the location of the station. Technically speaking, we simply need to introduce `orig` and `dest` in the the model:
+where $\delta_i$ and $\delta_j$ are origin and destination station fixed effects[^05-flows-6], and the rest is as above. This strategy accounts for all the unobserved heterogeneity associated with the location of the station. Technically speaking, we simply need to introduce `orig` and `dest` in the the model:
-[^05-flows-7]: In this session, $\delta_i$ and $\delta_j$ are estimated as independent variables so their estimates are similar to interpret to those in $\beta$. An alternative approach could be to model them as random effects in a multilevel framework.
+[^05-flows-6]: In this session, $\delta_i$ and $\delta_j$ are estimated as independent variables so their estimates are similar to interpret to those in $\beta$. An alternative approach could be to model them as random effects in a multilevel framework.
```{r}
m3 <- glm(
'trips15 ~ straight_dist + total_up + total_down + orig + dest',
- data=db_std,
- family=poisson
+ data = db_std,
+ family = poisson
)
```
-And with our new model, we can have a look at how well it does at predicting the overall number of trips[^05-flows-8]:
+And with our new model, we can have a look at how well it does at predicting the overall number of trips[^05-flows-7]:
-[^05-flows-8]: Although, theoretically, we could also include simulations of the model in the plot to get a better sense of the uncertainty behind our model, in practice this seems troublesome. The problems most likely arise from the fact that many of the origin and destination binary variable coefficients are estimated with a great deal of uncertainty. This causes some of the simulation to generate extreme values that, when passed through the exponential term of the Poisson link function, cause problems. If anything, this is testimony of how a simple fixed effect model can sometimes lack accuracy and generate very uncertain estimates. A potential extension to work around these problems could be to fit a multilevel model with two specific levels beyond the trip-level: one for origin and another one for destination stations.
+[^05-flows-7]: Although, theoretically, we could also include simulations of the model in the plot to get a better sense of the uncertainty behind our model, in practice this seems troublesome. The problems most likely arise from the fact that many of the origin and destination binary variable coefficients are estimated with a great deal of uncertainty. This causes some of the simulation to generate extreme values that, when passed through the exponential term of the Poisson link function, cause problems. If anything, this is testimony of how a simple fixed effect model can sometimes lack accuracy and generate very uncertain estimates. A potential extension to work around these problems could be to fit a multilevel model with two specific levels beyond the trip-level: one for origin and another one for destination stations.
-```{r fig.fullwidth=TRUE}
+```{r}
plot(
- density(m3$fitted.values),
- xlim=c(-100, max(db_std$trips15)),
- ylim=c(0, max(c(
- max(density(m3$fitted.values)$y),
- max(density(db_std$trips15)$y)
+ density( m3$fitted.values ),
+ xlim = c(-100, max( db_std$trips15 )),
+ ylim = c(0, max(c(
+ max( density(m3$fitted.values)$y ),
+ max( density(db_std$trips15)$y )
)
)
),
- col='black',
- main=''
+ col = 'black',
+ main = ''
)
lines(
- density(db_std$trips15),
- col='red',
- main=''
+ density( db_std$trips15 ),
+ col = 'red',
+ main = ''
)
legend(
'topright',
c('Predicted', 'Actual'),
- col=c('black', 'red'),
- lwd=1
+ col = c('black', 'red'),
+ lwd = 1
)
-title(main="Predictive check - Orig/dest FE Poisson model")
+title( main = "Predictive check - Orig/dest FE Poisson model")
```
That looks significantly better, doesn't it? In fact, our model now better accounts for the long tail where a few routes take a lot of trips. This is likely because the distribution of trips is far from random across stations and our origin and destination fixed effects do a decent job at accounting for that structure. However our model is still notably underpredicting less popular routes and overpredicting routes with above average number of trips. Maybe we should think about moving beyond a simple linear model.
@@ -545,38 +553,38 @@ As an exampe of this approach, we can replace the straight distance measurements
```{r}
m4 <- glm(
'trips15 ~ street_dist + total_up + total_down + orig + dest',
- data=db_std,
- family=poisson
+ data = db_std,
+ family = poisson
)
```
And we can similarly get a sense of our predictive fitting with:
-```{r fig.fullwidth=TRUE}
+```{r}
plot(
- density(m4$fitted.values),
- xlim=c(-100, max(db_std$trips15)),
- ylim=c(0, max(c(
- max(density(m4$fitted.values)$y),
- max(density(db_std$trips15)$y)
+ density( m4$fitted.values ),
+ xlim = c(-100, max( db_std$trips15)),
+ ylim = c(0, max(c(
+ max( density(m4$fitted.values)$y ),
+ max( density(db_std$trips15)$y )
)
)
),
- col='black',
- main=''
+ col = 'black',
+ main = ''
)
lines(
- density(db_std$trips15),
- col='red',
- main=''
+ density( db_std$trips15 ),
+ col = 'red',
+ main = ''
)
legend(
'topright',
c('Predicted', 'Actual'),
- col=c('black', 'red'),
- lwd=1
+ col = c('black', 'red'),
+ lwd = 1
)
-title(main="Predictive check - Orig/dest FE Poisson model")
+title( main = "Predictive check - Orig/dest FE Poisson model")
```
Hard to tell any noticeable difference, right? To see if there is any, we can have a look at the estimates obtained:
@@ -591,17 +599,17 @@ And compare this to that of the straight distances in the previous model:
summary(m3)$coefficients['straight_dist', ]
```
-As we can see, the differences exist but they are not massive. Let's use this example to learn how to interpret coefficients in a Poisson model[^05-flows-9]. Effectively, these estimates can be understood as multiplicative effects. Since our model fits
+As we can see, the differences exist but they are not massive. Let's use this example to learn how to interpret coefficients in a Poisson model[^05-flows-8]. Effectively, these estimates can be understood as multiplicative effects. Since our model fits
-[^05-flows-9]: See section 6.2 of @gelman2006data for a similar treatment of these.
+[^05-flows-8]: See section 6.2 of @gelman2006data for a similar treatment of these.
$$
T_{ij} \sim Poisson (\exp^{X_{ij}\beta})
$$
-we need to transform $\beta$ through an exponential in order to get a sense of the effect of distance on the number of trips. This means that for the street distance, our original estimate is $\beta_{street} = -0.0996$, but this needs to be translated through the exponential into $e^{-0.0996} = 0.906$. In other words, since distance is expressed in standard deviations[^05-flows-10], we can expect a 10% decrease in the number of trips for an increase of one standard deviation (about 1Km) in the distance between the stations. This can be compared with $e^{-0.0782} = 0.925$ for the straight distances, or a reduction of about 8% the number of trips for every increase of a standard deviation (about 720m).
+we need to transform $\beta$ through an exponential in order to get a sense of the effect of distance on the number of trips. This means that for the street distance, our original estimate is $\beta_{street} = -0.0996$, but this needs to be translated through the exponential into $e^{-0.0996} = 0.906$. In other words, since distance is expressed in standard deviations[^05-flows-9], we can expect a 10% decrease in the number of trips for an increase of one standard deviation (about 1Km) in the distance between the stations. This can be compared with $e^{-0.0782} = 0.925$ for the straight distances, or a reduction of about 8% the number of trips for every increase of a standard deviation (about 720m).
-[^05-flows-10]: Remember the transformation at the very beginning.
+[^05-flows-9]: Remember the transformation at the very beginning.
## Predicting flows
@@ -635,9 +643,15 @@ rmse_m4 <- rmse(db_std$trips16, m4$fitted.values)
rmse_m4
```
-That means that, on average, predictions in our best model `m4` are 256 trips off. Is this good? Bad? Worse? It's hard to say but, being practical, what we can say is whether this better than our alternative. Let us have a look at the RMSE of the other models as well as that of simply plugging in last year's counts:[^05-flows-11]
+That means that, on average, predictions in our best model `m4` are 256 trips off. Is this good? Bad? Worse? It's hard to say but, being practical, what we can say is whether this better than our alternative. Let us have a look at the RMSE of the other models as well as that of simply plugging in last year's counts:\[\^05-flows-11\]
+
+::: column-margin
+::: {.callout-tip appearance="simple" icon="false"}
+**Task**
-[^05-flows-11]: **EXERCISE**: can you create a single plot that displays the distribution of the predicted values of the five different ways to predict trips in 2016 and the actual counts of trips?
+Can you create a single plot that displays the distribution of the predicted values of the five different ways to predict trips in 2016 and the actual counts of trips?
+:::
+:::
```{r}
rmses <- data.frame(
diff --git a/06-spatial-econometrics.qmd b/06-spatial-econometrics.qmd
index 05564b7..8945b92 100644
--- a/06-spatial-econometrics.qmd
+++ b/06-spatial-econometrics.qmd
@@ -5,39 +5,25 @@ This chapter is based on the following references, which are good follow-up's on
- [Chapter 11](https://geographicdata.science/book/notebooks/11_regression.html) of the GDS Book, by @reyABwolf.
- [Session III](http://darribas.org/sdar_mini/notes/Class_03.html) of @arribas2014spatial. Check the "Related readings" section on the session page for more in-depth discussions.
- @anselin2005spatial, freely available to download \[[`pdf`](https://dces.wisc.edu/wp-content/uploads/sites/128/2013/08/W14_Anselin2007.pdf)\].
-- The second part of this tutorial assumes you have reviewed [Block E](https://darribas.org/gds_course/content/bE/concepts_E.html) of @darribas_gds_course.
+- The second part of this tutorial assumes you have reviewed [the Spatial Weights Section](https://darribas.org/gds_course/content/bE/concepts_E.html) of @darribas_gds_course.
## Dependencies
We will rely on the following libraries in this section, all of them included in @sec-dependencies:
-```{r results='hide', message = FALSE}
-# Layout
-library(tufte)
-# For pretty table
-library(knitr)
-# For string parsing
-library(stringr)
+```{r}
+#| warning: false
+#| message: false
+# Data management
+library(tidyverse)
# Spatial Data management
-library(rgdal)
-# Pretty graphics
-library(ggplot2)
-# Pretty maps
-library(ggmap)
+library(sf)
# For all your interpolation needs
library(gstat)
-# For data manipulation
-library(dplyr)
# Spatial regression
library(spdep)
```
-Before we start any analysis, let us set the path to the directory where we are working. We can easily do that with `setwd()`. Please replace in the following line the path to the folder where you have placed this file and where the `house_transactions` folder with the data lives.
-
-```{r}
-setwd('.')
-```
-
## Data
To explore ideas in spatial regression, we will the set of Airbnb properties for San Diego (US), borrowed from the "Geographic Data Science with Python" book (see [here](https://geographicdata.science/book/data/airbnb/regression_cleaning.html) for more info on the dataset source). This covers the point location of properties advertised on the Airbnb website in the San Diego region.
@@ -80,9 +66,15 @@ $$
\log(P) = \alpha + \beta_1 Acc + \beta_2 Bath + \beta_3 Bedr + \beta_4 Beds + \epsilon
$$ where each term can be interpreted in terms of vectors instead of scalars (wit the exception of the parameters $(\alpha, \beta_{1, 2, 3, 4})$, which *are* scalars). Note we are using the logarithm of the price, since this allows us to interpret the coefficients as roughly the percentage change induced by a unit increase in the explanatory variable of the estimate.
-Remember a regression can be seen as a multivariate extension of bivariate correlations. Indeed, one way to interpret the $\beta_k$ coefficients in the equation above is as the degree of correlation between the explanatory variable $k$ and the dependent variable, *keeping all the other explanatory variables constant*. When you calculate simple bivariate correlations, the coefficient of a variable is picking up the correlation between the variables, but it is also subsuming into it variation associated with other correlated variables --also called confounding factors[^06-spatial-econometrics-1]. Regression allows you to isolate the distinct effect that a single variable has on the dependent one, once we *control* for those other variables.
+Remember a regression can be seen as a multivariate extension of bivariate correlations. Indeed, one way to interpret the $\beta_k$ coefficients in the equation above is as the degree of correlation between the explanatory variable $k$ and the dependent variable, *keeping all the other explanatory variables constant*. When you calculate simple bivariate correlations, the coefficient of a variable is picking up the correlation between the variables, but it is also subsuming into it variation associated with other correlated variables --also called confounding factors\[\^06-spatial-econometrics-1\]. Regression allows you to isolate the distinct effect that a single variable has on the dependent one, once we *control* for those other variables.
+
+::: column-margin
+::: {.callout-tip appearance="simple" icon="false"}
+**Task**
-[^06-spatial-econometrics-1]: **EXAMPLE** Assume that new houses tend to be built more often in areas with low deprivation. If that is the case, then $NEW$ and $IMD$ will be correlated with each other (as well as with the price of a house, as we are hypothesizing in this case). If we calculate a simple correlation between $P$ and $IMD$, the coefficient will represent the degree of association between both variables, but it will also include some of the association between $IMD$ and $NEW$. That is, part of the obtained correlation coefficient will be due not to the fact that higher prices tend to be found in areas with low IMD, but to the fact that new houses tend to be more expensive. This is because (in this example) new houses tend to be built in areas with low deprivation and simple bivariate correlation cannot account for that.
+Assume that new houses tend to be built more often in areas with low deprivation. If that is the case, then $NEW$ and $IMD$ will be correlated with each other (as well as with the price of a house, as we are hypothesizing in this case). If we calculate a simple correlation between $P$ and $IMD$, the coefficient will represent the degree of association between both variables, but it will also include some of the association between $IMD$ and $NEW$. That is, part of the obtained correlation coefficient will be due not to the fact that higher prices tend to be found in areas with low IMD, but to the fact that new houses tend to be more expensive. This is because (in this example) new houses tend to be built in areas with low deprivation and simple bivariate correlation cannot account for that.
+:::
+:::
Practically speaking, running linear regressions in `R` is straightforward. For example, to fit the model specified in the equation above, we only need one line of code:
@@ -98,9 +90,9 @@ In order to inspect the results of the model, the quickest way is to call `summa
summary(m1)
```
-A full detailed explanation of the output is beyond the scope of the chapter, but we will highlight the relevant bits for our main purpose. This is concentrated on the `Coefficients` section, which gives us the estimates for the $\beta_k$ coefficients in our model. These estimates are the raw equivalent of the correlation coefficient between each explanatory variable and the dependent one, once the "polluting" effect of the other variables included in the model has been accounted for[^06-spatial-econometrics-2]. Results are as expected for the most part: houses tend to be significantly more expensive if they accommodate more people (an extra person increases the price by `r round(m1$coefficients[["accommodates"]], 3) * 100`%, approximately), have more bathrooms (`r round(m1$coefficients[["bathrooms"]], 3) * 100`%), or bedrooms (`r round(m1$coefficients[["bedrooms"]], 3) * 100`%). Perhaps counter intuitively, an extra bed available seems to decrease the price by about `r round(m1$coefficients[["beds"]], 3) * 100`%. However, keep in mind that this is the case, *everything else equal*. Hence, more beds per room and bathroom (ie. a more crowded house) is a bit cheaper.
+A full detailed explanation of the output is beyond the scope of the chapter, but we will highlight the relevant bits for our main purpose. This is concentrated on the `Coefficients` section, which gives us the estimates for the $\beta_k$ coefficients in our model. These estimates are the raw equivalent of the correlation coefficient between each explanatory variable and the dependent one, once the "polluting" effect of the other variables included in the model has been accounted for[^06-spatial-econometrics-1]. Results are as expected for the most part: houses tend to be significantly more expensive if they accommodate more people (an extra person increases the price by `r round(m1$coefficients[["accommodates"]], 3) * 100`%, approximately), have more bathrooms (`r round(m1$coefficients[["bathrooms"]], 3) * 100`%), or bedrooms (`r round(m1$coefficients[["bedrooms"]], 3) * 100`%). Perhaps counter intuitively, an extra bed available seems to decrease the price by about `r round(m1$coefficients[["beds"]], 3) * 100`%. However, keep in mind that this is the case, *everything else equal*. Hence, more beds per room and bathroom (ie. a more crowded house) is a bit cheaper.
-[^06-spatial-econometrics-2]: Keep in mind that regression is no magic. We are only discounting the effect of other confounding factors that we include in the model, not of *all* potentially confounding factors.
+[^06-spatial-econometrics-1]: Keep in mind that regression is no magic. We are only discounting the effect of other confounding factors that we include in the model, not of *all* potentially confounding factors.
## Spatial regression: a (very) first dip
@@ -174,9 +166,7 @@ Remember that the interpretation of a $\beta_k$ coefficient is the effect of var
**Spatial regimes**
-At the core of estimating spatial FEs is the idea that, instead of assuming the dependent variable behaves uniformly over space, there are systematic effects following a geographical pattern that affect its behaviour. In other words, spatial FEs introduce econometrically the notion of spatial heterogeneity. They do this in the simplest possible form: by allowing the constant term to vary geographically. The other elements of the regression are left untouched and hence apply uniformly across space. The idea of spatial regimes (SRs) is to generalize the spatial FE approach to allow not only the constant term to vary but also any other explanatory variable. This implies that the equation we will be estimating is: $$
-\log(P_i) = \alpha_r + \beta_{1r} Acc_i + \beta_{2r} Bath_i + \beta_{3r} Bedr_i + \beta_{4r} Beds_i + \epsilon_i
-$$
+At the core of estimating spatial FEs is the idea that, instead of assuming the dependent variable behaves uniformly over space, there are systematic effects following a geographical pattern that affect its behaviour. In other words, spatial FEs introduce econometrically the notion of spatial heterogeneity. They do this in the simplest possible form: by allowing the constant term to vary geographically. The other elements of the regression are left untouched and hence apply uniformly across space. The idea of spatial regimes (SRs) is to generalize the spatial FE approach to allow not only the constant term to vary but also any other explanatory variable. This implies that the equation we will be estimating is: $$\log(P_i) = \alpha_r + \beta_{1r} Acc_i + \beta_{2r} Bath_i + \beta_{3r} Bedr_i + \beta_{4r} Beds_i + \epsilon_i$$
where we are not only allowing the constant term to vary by region ($\alpha_r$), but also every other parameter ($\beta_{kr}$).
@@ -191,7 +181,7 @@ Then, the estimation leverages the capabilities in model description of R formul
```{r}
# `:` notation implies interaction variables
m3 <- lm(
- 'log_price ~ 0 + (accommodates + bathrooms + bedrooms + beds):(neighborhood)',
+ 'log_price ~ (one + accommodates + bathrooms + bedrooms + beds):(neighborhood)',
db
)
summary(m3)
@@ -205,9 +195,9 @@ As we have just discussed, SH is about effects of phenomena that are *explicitly
**Spatial Weights**
-There are several ways to introduce spatial dependence in an econometric framework, with varying degrees of econometric sophistication [see @anselin2003spatial for a good overview]. Common to all of them however is the way space is formally encapsulated: through *spatial weights matrices (*$W$)[^06-spatial-econometrics-3] These are $NxN$ matrices with zero diagonals and every $w_{ij}$ cell with a value that represents the degree of spatial connectivity/interaction between observations $i$ and $j$. If they are not connected at all, $w_{ij}=0$, otherwise $w_{ij}>0$ and we call $i$ and $j$ neighbors. The exact value in the latter case depends on the criterium we use to define neighborhood relations. These matrices also tend to be row-standardized so the sum of each row equals to one.
+There are several ways to introduce spatial dependence in an econometric framework, with varying degrees of econometric sophistication [see @anselin2003spatial for a good overview]. Common to all of them however is the way space is formally encapsulated: through *spatial weights matrices (*$W$)[^06-spatial-econometrics-2] These are $NxN$ matrices with zero diagonals and every $w_{ij}$ cell with a value that represents the degree of spatial connectivity/interaction between observations $i$ and $j$. If they are not connected at all, $w_{ij}=0$, otherwise $w_{ij}>0$ and we call $i$ and $j$ neighbors. The exact value in the latter case depends on the criterium we use to define neighborhood relations. These matrices also tend to be row-standardized so the sum of each row equals to one.
-[^06-spatial-econometrics-3]: If you need to refresh your knowledge on spatial weight matrices, check [Block E](https://darribas.org/gds_course/content/bE/concepts_E.html) of @darribas_gds_course; [Chapter 4](https://geographicdata.science/book/notebooks/04_spatial_weights.html) of @reyABwolf; or the [Spatial Weights](https://fcorowe.github.io/intro-gds/03-spatial_weights.html) Section of @rowe2022a.
+[^06-spatial-econometrics-2]: If you need to refresh your knowledge on spatial weight matrices, check [Block E](https://darribas.org/gds_course/content/bE/concepts_E.html) of @darribas_gds_course; [Chapter 4](https://geographicdata.science/book/notebooks/04_spatial_weights.html) of @reyABwolf; or the [Spatial Weights](https://fcorowe.github.io/intro-gds/03-spatial_weights.html) Section of @rowe2022a.
A related concept to spatial weight matrices is that of *spatial lag*. This is an operator that multiplies a given variable $y$ by a spatial weight matrix:
@@ -307,8 +297,9 @@ The core idea is that once you have estimates for the way in which the explanato
Conceptually, predicting in linear regression models involves using the estimates of the parameters to obtain a value for the dependent variable:
$$
-\bar{\log(P_i)} = \bar{\alpha} + \bar{\beta_1} Acc_i^* + \bar{\beta_2 Bath_i^*} + \bar{\beta_3} Bedr_i^* + \bar{\beta_4} Beds_i^*
-$$ where $\bar{\log{P_i}}$ is our predicted value, and we include the $\bar{}$ sign to note that it is our estimate obtained from fitting the model. We use the $^*$ sign to note that those can be new values for the explanatory variables, not necessarily those used to fit the model.
+\log(\bar{P_i}) = \bar{\alpha} + \bar{\beta_1} Acc_i^* + \bar{\beta_2} Bath_i^* + \bar{\beta_3} Bedr_i^* + \bar{\beta_4} Beds_i^*
+$$
+where $\log(\bar{P_i})$ is our predicted value, and we include the bar sign to note that it is our estimate obtained from fitting the model. We use the $^*$ sign to note that those can be new values for the explanatory variables, not necessarily those used to fit the model.
Technically speaking, prediction in linear models is relatively streamlined in R. Suppose we are given data for a new house which is to be put on the AirBnb platform. We know it accommodates four people, and has two bedrooms, three beds, and one bathroom. We also know that the surrounding properties have, on average, 1.5 bathrooms. Let us record the data first:
diff --git a/07-multilevel-01.qmd b/07-multilevel-01.qmd
index b9ecb44..872851c 100644
--- a/07-multilevel-01.qmd
+++ b/07-multilevel-01.qmd
@@ -7,31 +7,32 @@ This chapter provides an introduction to multi-level data structures and multi-l
## Dependencies
-This chapter uses the following libraries which are listed in the \[Dependency list\] Section of Chapter 1:
+We will use the following dependencies
-```{r, message = FALSE}
+```{r}
+#| include = FALSE
+source("./style/data-visualisation_theme.R")
+```
+
+```{r}
+#| message: false
+#| warning: false
# Data manipulation, transformation and visualisation
library(tidyverse)
# Nice tables
library(kableExtra)
-# Simple features (a standardised way to encode vector data ie. points, lines, polygons)
+# Spatial data manitulation
library(sf)
-# Spatial objects conversion
-library(sp)
# Thematic maps
library(tmap)
# Colour palettes
-library(RColorBrewer)
-# More colour palettes
-library(viridis) # nice colour schemes
+library(viridis)
# Fitting multilevel models
library(lme4)
# Tools for extracting information generated by lme4
library(merTools)
# Exportable regression tables
library(jtools)
-library(stargazer)
-library(sjPlot)
```
## Data
@@ -42,9 +43,7 @@ For this chapter, we will data for Liverpool from England's 2011 Census. The ori
Let us read the data:
-```{r, results = 'hide'}
-# clean workspace
-rm(list=ls())
+```{r}
# read data
oa_shp <- st_read("data/mlm/OA.shp")
```
@@ -52,6 +51,7 @@ oa_shp <- st_read("data/mlm/OA.shp")
We can now attach and visualise the structure of the data.
```{r}
+#| message: false
# attach data frame
attach(oa_shp)
@@ -112,8 +112,8 @@ We first need to want to understand our dependent variable: its density ditribut
```{r}
ggplot(data = oa_shp) +
-geom_density(alpha=0.8, colour="black", fill="lightblue", aes(x = unemp)) +
- theme_classic()
+ geom_density(alpha=0.8, colour="black", fill="lightblue", aes(x = unemp)) +
+ theme_plot_tufte()
```
```{r}
@@ -186,11 +186,17 @@ model2 <- lmer(eq2, data = oa_shp)
summary(model2)
```
-We can estimate a three-level model by adding `(1 | msoa_cd)` to allow the intercept to also vary by MSOAs and account for the nesting structure of LSOAs within MSOAs.
+We can estimate a three-level model by replacing `(1 | lsoa_cd)` for `(1 | msoa_cd/lsoa_cd)` to allow the intercept to also vary by MSOAs and account for the nesting structure of LSOAs within MSOAs. In multilevel modelling, these types of models are formally known as *nested random effects* and they differ from a different set of models known as *crossed random effects*.
+
+::: column-margin ::: callout-note A crossed random effect model in our example would be expressed as follows:
+
+`unemp ~ 1 + (1 | lsoa_cd) + (1 | msoa_cd)`
+
+::: ::: column-margin
```{r}
# specify a model equation
-eq3 <- unemp ~ 1 + (1 | lsoa_cd) + (1 | msoa_cd)
+eq3 <- unemp ~ 1 + (1 | msoa_cd/lsoa_cd)
model3 <- lmer(eq3, data = oa_shp)
# estimates
@@ -230,21 +236,27 @@ We start by examining the fixed effects or estimated model averaging over LSOAs
fixef(model3)
```
-Th estimated intercept indicates that the overall mean taken across LSOAs and MSOAs is estimated as `0.115288` and is statistically significant at `5%` significance.
+Th estimated intercept indicates that the overall mean taken across LSOAs and MSOAs is estimated as `0.1152881`.
> Random effects
-The set of random effects contains three estimates of variance and standard deviation and refer to the variance components discussed above. The `lsoa_cd`, `msoa_cd` and `Residual` estimates indicate that the extent of estimated LSOA-, MSOA- and individual-level variance is `0.0007603`, `0.0020735` and `0.0025723`, respectively.
+The set of random effects contains three estimates of variance and standard deviation and refer to the variance components discussed above. The `lsoa_cd:msoa_cd`, `msoa_cd` and `Residual` estimates indicate that the extent of estimated LSOA-, MSOA- and individual-level variance is `0.0007603`, `0.0020735` and `0.0025723`, respectively.
### Variance Partition Coefficient (VPC)
-The purpose of multilevel models is to partition variance in the outcome between the different groupings in the data. We thus often want to know the percentage of variation in the dependent variable accounted by differences across groups i.e. what proportion of the total variance is attributable to variation within-groups, or how much is found between-groups. The statistic to obtain this is termed the variance partition coefficient (VPC), or intraclass correlation.[^07-multilevel-01-2] For our case, the VPC at the LSOA level indicates that 14% of the variation in percentage of unemployed resident population across OAs can be explained by differences across LSOAs. What is the VPC at the MSOA level?
+The purpose of multilevel models is to partition variance in the outcome between the different groupings in the data. We thus often want to know the percentage of variation in the dependent variable accounted by differences across groups i.e. what proportion of the total variance is attributable to variation within-groups, or how much is found between-groups. The statistic to obtain this is termed the variance partition coefficient (VPC), or intraclass correlation.[^07-multilevel-01-2] For our case, the VPC at the MSOA level indicates that 38% of the variation in percentage of unemployed resident population across OAs can be explained by differences across MSOAs.
[^07-multilevel-01-2]: The VPC is equal to the intra-class correlation coefficient which is the correlation between the observations of the dependent variable selected randomly from the same group. For instance, if the VPC is 0.1, we would say that 10% of the variation is between groups and 90% within. The correlation between randomly chosen pairs of observations belonging to the same group is 0.1.
+::: column-margin
+::: {.callout-tip appearance="simple" icon="false"}
+**Task** What is the VPC at the LSOA level?
+:::
+:::
+
```{r}
-vpc_lsoa <- 0.0007603 / (0.0007603 + 0.0020735 + 0.0025723)
-vpc_lsoa * 100
+vpc_msoa <- 0.0020735 / (0.0007603 + 0.0020735 + 0.0025723)
+vpc_msoa * 100
```
You can also obtain the VPC by executing:
@@ -274,7 +286,13 @@ coef_m3 <- coef(model3)
head(coef_m3$lsoa_cd,5)
```
-The results indicate that the estimated regression line is $y = 0.09915456$ for LSOA `E01006512`; $y = 0.09889615$ for LSOA `E01006513` and so forth. Try getting the estimated model within each MSOA.
+The results indicate that the estimated regression line is $y = 0.09915456$ for LSOA `E01006512` within MSOA `E02001377`; $y = 0.09889615$ for LSOA `E01006513` within MSOA `E02006932` and so forth.
+
+::: column-margin
+::: {.callout-tip appearance="simple" icon="false"}
+**Task** Try getting the estimated model within each MSOA.
+:::
+:::
*Random effects*
@@ -291,7 +309,8 @@ We can also obtain group-level errors (*random effects*) by using a simulation a
```{r}
# obtain estimates
-merTools::REsim(model3) %>% head(10)
+merTools::REsim(model3) %>%
+ head(10)
```
The results contain the estimated mean, median and standard deviation for the intercept within each group (e.g. LSOA). The mean estimates are similar to those obtained from `ranef` with some small differences due to rounding.
@@ -303,7 +322,13 @@ To gain an undertanding of the general pattern of the *random effects*, we can u
plotREsim(REsim(model3))
```
-Focusing on the plot on the right, we see MSOAs whose mean proportion of unemployed population, assuming no explanatory variables, is lower than average. On the right-hand side of the plot, you will see MSOAs whose mean proportion is higher than average. The MSOAs with the smallest residuals include the districts of Allerton and Hunt Cross, Church, Childwall, Wavertree and Woolton. What districts do we have at the other extreme?
+Focusing on the plot on the right, we see MSOAs whose mean proportion of unemployed population, assuming no explanatory variables, is lower than average. These are the dots below the horizontal red line. On the right-hand side of the plot, you will see MSOAs whose mean proportion is higher than average. The MSOAs with the smallest residuals include the districts of Allerton and Hunt Cross, Church, Childwall, Wavertree and Woolton.
+
+::: column-margin
+::: {.callout-tip appearance="simple" icon="false"}
+**Task** What districts do we have at the other extreme? Have a go at identifying them.
+:::
+:::
```{r}
re <- REsim(model3)
@@ -382,10 +407,10 @@ ranef_m4 <- ranef(model4)
head(ranef_m4$msoa_cd, 5)
```
-yields an estimated intercept for MSOA `E02001347` which is `0.017474815` lower than the average with a regression line: `(0.04681959 - 0.017474815) + 0.29588110x` `=` `0.02934478 + 0.29588110x`. You can confirm this by looking at the estimated model within each MSOA by executing (remove the `#` sign):
+yields an estimated intercept for MSOA `E02001347` which is `0.017474815` lower than the average with a regression line: `(0.04681959 - 0.017474815) + 0.29588110x` `=` `0.02934478 + 0.29588110x`. You can confirm this by looking at the estimated model within each MSOA by executing on the first row:
```{r}
-#coef(model4)
+coef(model4) %>% head(n = 5 )
```
*Fixed effect correlations*
@@ -407,6 +432,8 @@ $$\beta_{0j} = \beta_{0} + \gamma_{1}m_{j} + u_{0j}$$
where $x_{ij}$ is the OA-level proportion of population suffering long-term illness and $m_{j}$ is the MSOA-level proportion of male population. We first need to create this group-level predictor:
```{r}
+#| message: false
+#| warning: false
# detach OA shp and attach MSOA shp
detach(oa_shp)
attach(msoa_shp)
@@ -430,6 +457,7 @@ head(oa_shp[1:10, c("msoa_cd", "oa_cd", "unemp", "pr_male")])
We can now estimate our model:
```{r}
+#| message: false
detach(msoa_shp)
attach(oa_shp)
diff --git a/08-multilevel-02.qmd b/08-multilevel-02.qmd
index a60bf0c..36d0e96 100644
--- a/08-multilevel-02.qmd
+++ b/08-multilevel-02.qmd
@@ -24,17 +24,13 @@ library(sp)
# Thematic maps
library(tmap)
# Colour palettes
-library(RColorBrewer)
-# More colour palettes
-library(viridis) # nice colour schemes
+library(viridis)
# Fitting multilevel models
library(lme4)
# Tools for extracting information generated by lme4
library(merTools)
# Exportable regression tables
library(jtools)
-library(stargazer)
-library(sjPlot)
```
## Data
diff --git a/09-gwr.qmd b/09-gwr.qmd
index 907a3aa..f0a4d06 100644
--- a/09-gwr.qmd
+++ b/09-gwr.qmd
@@ -33,8 +33,6 @@ library(spgwr)
library(corrplot)
# Exportable regression tables
library(jtools)
-library(stargazer)
-library(sjPlot)
# Assess multicollinearity
library(car)
```
diff --git a/_quarto.yml b/_quarto.yml
index 93f0df9..00ba1ea 100644
--- a/_quarto.yml
+++ b/_quarto.yml
@@ -33,16 +33,28 @@ bibliography: references.bib
format:
html:
- toc: true
- theme: [default, theme.scss]
- css: style.css
- code-copy: true
+ theme:
+ light: [default, style/custom.scss]
+ fontsize: "16px"
+ linestretch: 1.6
+ mainfont: "Roboto"
+ monofont: "Fira Mono"
+ smooth-scroll: true
+ toc-depth: 3
grid:
- sidebar-width: 200px
+ sidebar-width: 300px
body-width: 900px
margin-width: 300px
- pdf:
- documentclass: scrreprt
+ code-link: true
+ code-tools:
+ toggle: true
+ code-fold: false
+ highlight-style: printing
+ code-block-bg: true
+ code-overflow: wrap
+ reference-location: margin
+ mermaid:
+ theme: neutral
editor: visual
diff --git a/docs/01-overview.html b/docs/01-overview.html
index 8ee24ac..689e258 100644
--- a/docs/01-overview.html
+++ b/docs/01-overview.html
@@ -1,12 +1,9 @@