-
Notifications
You must be signed in to change notification settings - Fork 19
/
Copy path03-data-wrangling.qmd
709 lines (481 loc) · 29.8 KB
/
03-data-wrangling.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
# Data Wrangling {#sec-chp3}
In this chapter, we will cover the fundamentals of the concepts and functions that you will need to know to navigate this book. We will introduce key concepts and functions relating to what computational notebooks are and how they work. We will also cover basic R functions and data types, including the use of factors. Additionally, we will offer a basic understanding of the manipulation and mapping of spatial data frames using commonly used libraries such as `tidyverse`, `sf`, `ggplot` and `tmap`.
If you are already familiar with R, R computational notebooks and data types, you may want to jump to Section [Read Data](#sec_readdata) and start from there. This section describes how to read and manipulate data using `sf` and `tidyverse` functions, including `mutate()`, `%>%` (known as pipe operator), `select()`, `filter()` and specific packages and functions how to manipulate spatial data.
The chapter is based on:
- @grolemund_wickham_2019_book, this book illustrates key libraries, including tidyverse, and functions for data manipulation in R
- @Xie_et_al_2019_book, excellent introduction to R markdown!
- @envs450_2018, some examples from the first lecture of ENVS450 are used to explain the various types of random variables.
- @lovelace2019, a really good book on handling spatial data and historical background of the evolution of R packages for spatial data analysis.
## Dependencies
This chapter uses the libraries below. Ensure they are installed on your machine[^03-data-wrangling-1] before you progress.
[^03-data-wrangling-1]: You can install package `mypackage` by running the command `install.packages("mypackage")` on the R prompt or through the `Tools --> Install Packages...` menu in RStudio.
```{r}
#| include = FALSE
rm(list=ls())
```
```{r}
#| warning = FALSE
# data manipulation, transformation and visualisation
library(tidyverse)
# nice tables
library(kableExtra)
# spatial data manipulation
library(sf)
# thematic mapping
library(tmap)
# colour palettes
library(RColorBrewer)
library(viridis)
```
## Introducing R
R is a freely available language and environment for statistical computing and graphics which provides a wide variety of statistical and graphical techniques. It has gained widespread use in academia and industry. R offers a wider array of functionality than a traditional statistics package, such as SPSS and is composed of core (base) functionality, and is expandable through libraries hosted on [The Comprehensive R Archive Network (CRAN)](https://cran.r-project.org). CRAN is a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R.
Commands are sent to R using either the terminal / command line or the R Console which is installed with R on either Windows or OS X. On Linux, there is no equivalent of the console, however, third party solutions exist. On your own machine, R can be installed from [here](https://www.r-project.org/).
Normally RStudio is used to implement R coding. RStudio is an integrated development environment (IDE) for R and provides a more user-friendly front-end to R than the front-end provided with R.
To run R or RStudio, just double click on the R or RStudio icon. Throughout this module, we will be using RStudio:
![Fig. 1. RStudio features.](figs/ch2/rstudio_features.png)
If you would like to know more about the various features of RStudio, watch this [video](https://rstudio.com/products/rstudio/)
## Setting the working directory
Before we start any analysis, ensure to set the path to the directory where we are working. We can easily do that with `setwd()`. Please replace in the following line the path to the folder where you have placed this file -and where the `data` folder lives.
```{r}
#| eval: false
setwd('../data/sar.csv')
setwd('.')
```
::: column-margin
::: callout-note
It is good practice to not include spaces when naming folders and files. Use *underscores* or *dots*.
:::
:::
You can check your current working directory by typing:
```{r}
getwd()
```
## R Scripts and Computational Notebooks
An *R script* is a series of commands that you can execute at one time and help you save time. So you do not repeat the same steps every time you want to execute the same process with different datasets. An R script is just a plain text file with R commands in it.
::: column-margin
::: callout-note
To get familiar with good practices in writing your code in R, we recommend the [Chapter Workflow: basics](https://r4ds.hadley.nz/workflow-basics.html) and [Workflow: scripts and projects](https://r4ds.hadley.nz/workflow-scripts) from the R in Data Science book by @wickham2023r.
:::
:::
https://r4ds.hadley.nz/workflow-basics.html
To create an R script in RStudio, you need to
- Open a new script file: *File* \> *New File* \> *R Script*
- Write some code on your new script window by typing eg. `mtcars`
- Run the script. Click anywhere on the line of code, then hit *Ctrl + Enter* (Windows) or *Cmd + Enter* (Mac) to run the command or select the code chunk and click *run* on the right-top corner of your script window. If do that, you should get:
```{r}
mtcars
```
- Save the script: *File* \> *Save As*, select your required destination folder, and enter any filename that you like, provided that it ends with the file extension *.R*
An *R Notebook* or a *Quarto Document* are a Markdown options with descriptive text and code chunks that can be executed independently and interactively, with output visible immediately beneath a code chunk - see @Xie_et_al_2019_book. A *Quarto Document* is an improved version of the original *R Notebook*. *Quarto Document* requires a package called [Quarto](https://quarto.org). Quarto does not have a dependency or requirement for R. Quarto is multilingual, beginning with R, Python, Javascript, and Julia. The concept is that Quarto will work even for languages that do not yet exist. This book was original written in *R Notebook* but later transitioned into *Quarto Documents*.
To create an R Notebook, you need to:
- Open a new script file: *File* \> *New File* \> *R Notebook*
![Fig. 2. YAML metadata for notebooks.](figs/ch2/rnotebook_yaml.png)
- Insert code chunks, either:
1) use the *Insert* command on the editor toolbar;
2) use the keyboard shortcut *Ctrl + Alt + I* or *Cmd + Option + I* (Mac); or,
3) type the chunk delimiters ```` ```{r} ```` and ```` ``` ````
In a chunk code you can produce text output, tables, graphics and write code! You can control these outputs via chunk options which are provided inside the curly brackets e.g.:
```{r, include=FALSE}
hist(mtcars$mpg)
```
![Fig. 3. Code chunk example. Details on the various options: https://rmarkdown.rstudio.com/lesson-3.html](figs/ch2/codechunk.png)
- Execute code: hit *"Run Current Chunk"*, *Ctrl + Shift + Enter* or *Cmd + Shift + Enter* (Mac)
- Save an R notebook: *File* \> *Save As*. A notebook has a `*.Rmd` extension and when it is saved a `*.nb.html` file is automatically created. The latter is a self-contained HTML file which contains both a rendered copy of the notebook with all current chunk outputs and a copy of the \*.Rmd file itself.
Rstudio also offers a *Preview* option on the toolbar which can be used to create pdf, html and word versions of the notebook. To do this, choose from the drop-down list menu `knit to ...`
To create a *Quarto Document*, you need to:
- Open a new script file: *File* \> *New File* \> *Quarto Document*
Quarto Documents work in the same way as R Notebooks with small variations. You find a comprehensive guide on the [Quarto website](https://quarto.org/docs/guide/).
## Getting Help
You can use `help` or `?` to ask for details for a specific function:
```{r}
#| eval: false
help(sqrt) #or
?sqrt
```
And using `example` provides examples for said function:
```{r}
example(sqrt)
```
## Variables and objects
An *object* is a data structure having attributes and methods. In fact, everything in R is an object!
A *variable* is a type of data object. Data objects also include list, vector, matrices and text.
- Creating a data object
In R a variable can be created by using the symbol `<-` to assign a value to a variable name. The variable name is entered on the left `<-` and the value on the right. Note: Data objects can be given any name, provided that they start with a letter of the alphabet, and include only letters of the alphabet, numbers and the characters `.` and `_`. Hence AgeGroup, Age_Group and Age.Group are all valid names for an R data object. Note also that R is case-sensitive, to agegroup and AgeGroup would be treated as different data objects.
To save the value *28* to a variable (data object) labelled *age*, run the code:
```{r}
age <- 28
```
- Inspecting a data object
To inspect the contents of the data object *age* run the following line of code:
```{r}
age
```
Find out what kind (class) of data object *age* is using:
```{r}
class(age)
```
Inspect the structure of the *age* data object:
```{r}
str(age)
```
- The *vector* data object
What if we have more than one response? We can use the `c( )` function to combine multiple values into one data vector object:
```{r}
age <- c(28, 36, 25, 24, 32)
age
class(age) #Still numeric..
str(age) #..but now a vector (set) of 5 separate values
```
Note that on each line in the code above any text following the `#` character is ignored by R when executing the code. Instead, text following a `#` can be used to add comments to the code to make clear what the code is doing. Two marks of good code are a clear layout and clear commentary on the code.
### Basic Data Types
There are a number of data types. Four are the most common. In R, **numeric** is the default type for numbers. It stores all numbers as floating-point numbers (numbers with decimals). This is because most statistical calculations deal with numbers with up to two decimals.
- Numeric
```{r}
num <- 4.5 # Decimal values
class(num)
```
- Integer
```{r}
int <- as.integer(4) # Natural numbers. Note integers are also numerics.
class(int)
```
- Character
```{r}
cha <- "are you enjoying this?" # text or string. You can also type `as.character("are you enjoying this?")`
class(cha)
```
- Logical
```{r}
log <- 2 < 1 # assigns TRUE or FALSE. In this case, FALSE as 2 is greater than 1
log
class(log)
```
### Random Variables
In statistics, we differentiate between data to capture:
- *Qualitative attributes* categorise objects eg.gender, marital status. To measure these attributes, we use *Categorical* data which can be divided into:
- *Nominal* data in categories that have no inherent order eg. gender
- *Ordinal* data in categories that have an inherent order eg. income bands
- *Quantitative attributes*:
- *Discrete* data: count objects of a certain category eg. number of kids, cars
- *Continuous* data: precise numeric measures eg. weight, income, length.
In R these three types of random variables are represented by the following types of R data object:
```{r, echo=FALSE}
text_tbl <- data.frame(
variables = c("nominal", "ordinal", "discrete", "continuous"),
objects = c("factor", "ordered factor", "numeric", "numeric")
)
kable(text_tbl)
```
We have already encountered the R data type *numeric*. The next section introduces the *factor* data type.
#### Factor
**What is a factor?**
A factor variable assigns a numeric code to each possible category (*level*) in a variable. Behind the scenes, R stores the variable using these numeric codes to save space and speed up computing. For example, compare the size of a list of 10,000 *males* and *females* to a list of 10,000 1s and 0s. At the same time R also saves the category names associated with each numeric code (level). These are used for display purposes.
For example, the variable *gender*, converted to a factor, would be stored as a series of 1s and 2s, where 1 = female and 2 = male; but would be displayed in all outputs using their category labels of *female* and *male*.
**Creating a factor**
To convert a numeric or character vector into a factor use the `factor( )` function. For instance:
```{r}
gender <- c("female","male","male","female","female") # create a gender variable
gender <- factor(gender) # replace character vector with a factor version
gender
class(gender)
str(gender)
```
Now *gender* is a factor and is stored as a series of 1s and 2s, with 1s representing females and 2s representing males. The function `levels( )` lists the levels (categories) associated with a given factor variable:
```{r}
levels(gender)
```
The categories are reported in the order that they have been numbered (starting from 1). Hence from the output we can infer that females are coded as 1, and males as 2.
## Data Frames
R stores different types of data using different types of data structure. Data are normally stored as a *data.frame*. A data frames contain one row per observation (e.g. wards) and one column per attribute (eg. population and health).
We create three variables wards, population (`pop`) and people with good health (`ghealth`). We use 2011 census data counts for total population and good health for wards in Liverpool.
```{r,results='hide'}
wards <- c("Allerton and Hunts Cross","Anfield","Belle Vale","Central","Childwall","Church","Clubmoor","County","Cressington","Croxteth","Everton","Fazakerley","Greenbank","Kensington and Fairfield","Kirkdale","Knotty Ash","Mossley Hill","Norris Green","Old Swan","Picton","Princes Park","Riverside","St Michael's","Speke-Garston","Tuebrook and Stoneycroft","Warbreck","Wavertree","West Derby","Woolton","Yew Tree")
pop <- c(14853,14510,15004,20340,13908,13974,15272,14045,14503,
14561,14782,16786,16132,15377,16115,13312,13816,15047,
16461,17009,17104,18422,12991,20300,16489,16481,14772,
14382,12921,16746)
ghealth <- c(7274,6124,6129,11925,7219,7461,6403,5930,7094,6992,
5517,7879,8990,6495,6662,5981,7322,6529,7192,7953,
7636,9001,6450,8973,7302,7521,7268,7013,6025,7717)
```
Note that `pop` and `ghealth` and `wards` contains characters.
### Creating A Data Frame
We can create a data frame and examine its structure:
```{r}
df <- data.frame(wards, pop, ghealth)
df # or use view(data)
str(df) # or use glimpse(data)
```
### Referencing Data Frames
To refer to particular parts of a dataframe - say, a particular column (an area attribute), or a subset of respondents. Hence it is worth spending some time understanding how to reference dataframes.
The relevant R function, `[ ]`, has the format `[row,col]` or, more generally, `[set of rows, set of cols]`.
Run the following commands to get a feel of how to extract different slices of the data:
```{r}
#| eval: false
df # whole data.frame
df[1, 1] # contents of first row and column
df[2, 2:3] # contents of the second row, second and third columns
df[1, ] # first row, ALL columns [the default if no columns specified]
df[ ,1:2] # ALL rows; first and second columns
df[c(1,3,5), ] # rows 1,3,5; ALL columns
df[ , 2] # ALL rows; second column (by default results containing only
#one column are converted back into a vector)
df[ , 2, drop=FALSE] # ALL rows; second column (returned as a data.frame)
```
In the above, note that we have used two other R functions:
- `1:3` The colon operator tells R to produce a list of numbers including the named start and end points.
- `c(1,3,5)` Tells R to combine the contents within the brackets into one list of objects
Run both of these fuctions on their own to get a better understanding of what they do.
Three other methods for referencing the contents of a data.frame make direct use of the variable names within the data.frame, which tends to make for easier to read/understand code:
```{r}
#| eval: false
df[,"pop"] # variable name in quotes inside the square brackets
df$pop # variable name prefixed with $ and appended to the data.frame name
# or you can use attach
attach(df)
pop # but be careful if you already have an age variable in your local workspace
```
Want to check the variables available, use the `names( )`:
```{r}
names(df)
```
## Read Data {#sec_readdata}
Ensure your memory is clear
```{r}
rm(list=ls()) # rm for targeted deletion / ls for listing all existing objects
```
::: column-margin
::: callout-note
When opening a file, ensure the correct directory set up pointing to your data. It may differ from your existing working directory.
:::
:::
There are many commands to read / load data onto R. The command to use will depend upon the format they have been saved. Normally they are saved in *csv* format from Excel or other software packages. So we use either:
- `df <- read.table("path/file_name.csv", header = FALSE, sep =",")`
- `df <- read("path/file_name.csv", header = FALSE)`
- `df <- read.csv2("path/file_name.csv", header = FALSE)`
To read files in other formats, refer to this useful [DataCamp tutorial](https://www.datacamp.com/community/tutorials/r-data-import-tutorial?utm_source=adwords_ppc&utm_campaignid=1655852085&utm_adgroupid=61045434382&utm_device=c&utm_keyword=%2Bread%20%2Bdata%20%2Br&utm_matchtype=b&utm_network=g&utm_adpostion=1t1&utm_creative=318880582308&utm_targetid=kwd-309793905111&utm_loc_interest_ms=&utm_loc_physical_ms=9046551&gclid=CjwKCAiA3uDwBRBFEiwA1VsajJO0QK0Jg7VipIt8_t82qQrnUliI0syAlh8CIxnE76Rb0kh3FbiehxoCzCgQAvD_BwE#csv)
```{r}
census <- read.csv("data/census/census_data.csv")
head(census)
```
### Quickly inspect the data
Using the following questions to lead the inspection: What class? What R data types? What data types?
```{r}
#| eval: false
# 1
class(census)
# 2 & 3
str(census)
```
Just interested in the variable names:
```{r}
names(census)
```
or want to view the data:
`View(census)`
## Manipulation Data
### Adding New Variables
Usually you want to add / create new variables to your data frame using existing variables eg. computing percentages by dividing two variables. There are many ways in which you can do this i.e. referecing a data frame as we have done above, or using `$` (e.g. `census$pop`). For this module, we'll use `tidyverse`:
```{r}
census <- census %>%
mutate( per_ghealth = ghealth / pop )
```
Note we used a *pipe operator* `%>%`, which helps make the code more efficient and readable - more details, see @grolemund_wickham_2019_book. When using the pipe operator, recall to first indicate the data frame before `%>%`.
Note also the use a variable name before the `=` sign in brackets to indicate the name of the new variable after `mutate`.
### Selecting Variables
Usually you want to select a subset of variables for your analysis as storing to large data sets in your R memory can reduce the processing speed of your machine. A selection of data can be achieved by using the `select` function:
```{r}
ndf <- census %>%
select( ward, pop16_74, per_ghealth )
```
Again first indicate the data frame and then the variable you want to select to build a new data frame. Note the code chunk above has created a new data frame called `ndf`. Explore it.
### Filtering Data
You may also want to filter values based on defined conditions. You may want to filter observations greater than a certain threshold or only areas within a certain region. For example, you may want to select areas with a percentage of good health population over 50%:
```{r}
ndf2 <- census %>%
filter( per_ghealth < 0.5 )
```
You can use more than one variables to set conditions. Use "`,`" to add a condition.
### Joining Data Drames
When working with spatial data, we often need to join data. To this end, you need a common unique `id variable`. Let's say, we want to add a data frame containing census data on households for Liverpool, and join the new attributes to one of the existing data frames in the workspace. First we will read the data frame we want to join (i.e. `census_data2.csv`).
```{r}
# read data
census2 <- read.csv("data/census/census_data2.csv")
# visualise data structure
str(census2)
```
The variable `geo_code` in this data frame corresponds to the `code` in the existing data frame and they are unique so they can be automatically matched by using the `merge()` function. The `merge()` function uses two arguments: `x` and `y`. The former refers to data frame 1 and the latter to data frame 2. Both of these two data frames must have a `id` variable containing the same information. Note they can have different names. Another key argument to include is `all.x=TRUE` which tells the function to keep all the records in `x`, but only those in `y` that match in case there are discrepancies in the `id` variable.
```{r}
# join data frames
join_dfs <- merge( census, # df1
census2, # df2
by.x="code", by.y="geo_code", # common ids
all.x = TRUE)
# check data
head(join_dfs)
```
### Saving Data
It may also be convenient to save your R projects. They contains all the objects that you have created in your workspace by using the `save.image( )` function:
```{r}
#| eval: false
save.image("week1_envs453.RData")
```
This creates a file labelled "week1_envs453.RData" in your working directory. You can load this at a later stage using the `load( )` function.
```{r}
#| eval: false
load("week1_envs453.RData")
```
Alternatively you can save / export your data into a `csv` file. The first argument in the function is the object name, and the second: the name of the csv we want to create.
```{r}
#| eval: false
write.csv(join_dfs, "join_censusdfs.csv")
```
## Using Spatial Data Frames
A core area of the module is learning to work with spatial data in R. R has various purposedly designed `packages` for manipulation of spatial data and spatial analysis techniques. Various packages exist in CRAN, including `sf` [@sf2018; @R-sf], `stars` [@R-stars], `terra`, `s2` [@R-s2], `lwgeom` [@R-lwgeom], `gstat` [@pebesma2004; @R-gstat], `spdep` [@R-spdep], `spatialreg` [@R-spatialreg], `spatstat` [@baddeley2015spatial; @R-spatstat], `tmap` [@tmap2018; @R-tmap], `mapview` [@R-mapview] and more. A key package is this ecosystem is `sf` [@pebesma2023spatial]. R package `sf` provides a table format for simple features, where feature geometries are stored in a list-column. It appeared in 2016 and was developed to move spatial data analysis in R closer to standards-based approaches seen in the industry and open source projects, to build upon more modern versions of open source geospatial software stack and allow for integration of R spatial software with the `tidyverse` [@tidyverse2019], particularly `ggplot2`, `dplyr`, and `tidyr`. Hence, this book relies heavely on `sf` for the manipulation and analysis of the data.
::: column-margin
::: callout-note
@lovelace2024 provide a helpful overview and evolution of R spatial package ecosystem.
:::
:::
To read our spatial data, we use the `st_read` function. We read a shapefile containing data at Output Area (OA) level for Liverpool. These data illustrates the hierarchical structure of spatial data.
### Read Spatial Data
```{r}
oa_shp <- st_read("data/census/Liverpool_OA.shp")
```
Examine the input data. A spatial data frame stores a range of attributes derived from a shapefile including the **geometry** of features (e.g. polygon shape and location), **attributes** for each feature (stored in the .dbf), [projection](https://en.wikipedia.org/wiki/Map_projection) and coordinates of the shapefile's bounding box - for details, execute:
```{r}
#| eval: false
?st_read
```
You can employ the usual functions to visualise the content of the created data frame:
```{r}
# visualise variable names
names(oa_shp)
# data structure
str(oa_shp)
# see first few observations
head(oa_shp)
```
::: column-margin
::: {.callout-tip appearance="simple" icon="false"}
**Task**
- What are the geographical hierarchy in these data?\
- What is the smallest geography?\
- What is the largest geography?\
:::
:::
### Basic Mapping
Many functions exist in CRAN for creating maps:
- `plot` to create static maps
- `tmap` to create static and interactive maps
- `leaflet` to create interactive maps
- `mapview` to create interactive maps
- `ggplot2` to create data visualisations, including static maps
- `shiny` to create web applications, including maps
In this book, we will make use of `plot`, `tmap` and `ggplot`. Normally you use `plot` to get a quick inspection of the data and `tmap` and `ggplot` to get publication quality data visualisations. First `plot` is used to map the spatial distribution of non-British-born population in Liverpool. First we only map the geometries on the right.
**Using `plot`**
We can use the base `plot` function to display the boundaries of OAs in Liverpool.
```{r}
# mapping geometry
plot(st_geometry(oa_shp))
```
To visualise a column in the spatial data frame, we can run:
```{r}
# map attributes, adding intervals
plot(oa_shp["Ethnic"], # variable to visualise
key.pos = 4,
axes = TRUE,
key.width = lcm(1.3),
key.length = 1.,
breaks = "jenks", # algorithm to categorise the data
lwd = 0.1,
border = 'grey') # boundary colour
```
::: column-margin
::: {.callout-tip appearance="simple" icon="false"}
**Task**
What is the key pattern emerging from this map?
:::
:::
Let us now explore `ggplot` or `tmap`.
**Using `ggplot`**
We can visualise spatial data frames using `ggplot` fairly easily. `ggplot` is a generic sets of functions which was not specifically designed for spatial mapping, but it is fairly flexible that allows producing great spatial data visualisations.
Following the grammar of `ggplot`, we plot spatial data drawing layers. `ggplot` has a basic structure of three components:
- The data i.e. `ggplot( data = *data frame*)`.
- Geometries i.e. `geom_xxx( )`.
- Aesthetic mapping i.e. `aes(x=*variable*, y=*variable*)`
We can put these three components together using `+`. This is similar to the ways we apply the pipe operator in for tidyverse. The latter component of aesthetic mapping can be added in both the `ggplot` and `geom_xxx` functions.
To map our data, we can then run:
```{r}
ggplot(data = oa_shp, aes( fill = Ethnic) ) + # add data frame and variable to map
geom_sf(colour = "gray60", # colour line
size = 0.1) # line size
```
We can change the colour palette by using a different colour palette. We can use the [viridis](https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html) package. The power of `viridis` is that it uses color scales that visually pleasing, colorblind friendly, print well in gray scale, and can be used for both categorical and continuous data. For categorical data you can also use [ColourBrewer](https://colorbrewer2.org/#type=sequential&scheme=BuGn&n=3).
So let us (1) change the colour pallete to viridis, (2) remove the colour of the boundaries, and (3) replace the theme with the theme we will be using for the book. Let's read the theme first and then implement these changes.
The `ggplot` themes for the book are in a file called `data-visualisation_theme.R` in the `style` folder. We read this file by running:
```{r}
source("./style/data-visualisation_theme.R")
```
We can now implement our changes:
```{r}
ggplot(data = oa_shp, aes( fill = Ethnic) ) + # add data frame and variable to map
geom_sf(colour = "transparent") + # colour line
scale_fill_viridis( option = "viridis" ) + # add viridis colour scheme
theme_map_tufte()
```
To master `ggplot`, see @wickham2009.
**Using `tmap`**
Similar to `ggplot2`, `tmap` is based on the idea of a 'grammar of graphics' which involves a separation between the input data and aesthetics (i.e. the way data are visualised). Each data set can be mapped in various different ways, including location as defined by its geometry, colour and other features. The basic building block is `tm_shape()` (which defines input data), followed by one or more layer elements such as `tm_fill()` and `tm_dots()`.
```{r}
# ensure geometry is valid
oa_shp = sf::st_make_valid(oa_shp)
# map
legend_title = expression("% ethnic pop.")
map_oa = tm_shape(oa_shp) +
tm_fill(col = "Ethnic", title = legend_title, palette = magma(256), style = "cont") + # add fill
tm_borders(col = "white", lwd = .01) + # add borders
tm_compass(type = "arrow", position = c("right", "top") , size = 4) + # add compass
tm_scale_bar(breaks = c(0,1,2), text.size = 0.5, position = c("center", "bottom")) # add scale bar
map_oa
```
Note that the operation `+`, as for `ggplot` is used to add new layers. You can set style themes by `tm_style`. To visualise the existing styles use `tmap_style_catalogue()`. An advantage of `tmap` is that you can easily create an interactive map by running `tmap_mode`.
```{r}
tmap_mode("view")
map_oa
```
::: column-margin
::: {.callout-tip appearance="simple" icon="false"}
**Task**
Try mapping other variables in the spatial data frame. Where do population aged 60 and over concentrate?
:::
:::
### Comparing geographies
If you recall, one of the key issues of working with spatial data is the modifiable area unit problem (MAUP) - see @spatial_data. To get a sense of the effects of MAUP, we analyse differences in the spatial patterns of the ethnic population in Liverpool between Middle Layer Super Output Areas (MSOAs) and OAs. So we map these geographies together.
::: column-margin
::: callout-note
The first line of the chunk code include `tmap_mode("plot")` which tells R that we want a static map. `tmap_mode` works like a switch to interactive and non-interactive mapping.
:::
:::
```{r}
tmap_mode("plot")
# read data at the msoa level
msoa_shp <- st_read("data/census/Liverpool_MSOA.shp")
# ensure geometry is valid
msoa_shp = sf::st_make_valid(msoa_shp)
# create a map
map_msoa = tm_shape(msoa_shp) +
tm_fill(col = "Ethnic", title = legend_title, palette = magma(256), style = "cont") +
tm_borders(col = "white", lwd = .01) +
tm_compass(type = "arrow", position = c("right", "top") , size = 4) +
tm_scale_bar(breaks = c(0,1,2), text.size = 0.5, position = c("center", "bottom"))
# arrange maps
tmap_arrange(map_msoa, map_oa)
```
::: column-margin
::: {.callout-tip appearance="simple" icon="false"}
**Task**
What differences do you see between OAs and MSOAs?
Can you identify areas of spatial clustering? Where are they?
:::
:::