This repository has been archived by the owner on Dec 30, 2023. It is now read-only.
forked from ColauttiLab/RCrashCourse_Book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03_ggplot.rmd
664 lines (421 loc) · 28.3 KB
/
03_ggplot.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
\newpage
# Advanced Visualizations
## Overview
Before continuing with this tutorial/chapter, you should be familiar with the [qplot() tutorial](https://colauttilab.github.io/RCrashCourse/2_qplot.html)/chapter, and you should have lots of practice making graphs with different formatting options using `qplot`.
In this self-tutorial, we look at some more advanced options for visualizations by using the `ggplot` function. If you have a good feel for `plot`, then this will be an easy transition, with `ggplot` adding even more flexibility to our visualizations. This includes everything you will need to make professional-grade figures.
The [ggplot 'cheat sheet'](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf) is a downloadable pdf that provides a good summary and quick-reference guide for advanced graphics. Other useful cheat sheets are available here: https://www.rstudio.com/resources/cheatsheets/
## Getting Started
First, we'll load the `ggplot2` library and set a custom theme as described in the [`qplot` Tutorial](https://colauttilab.github.io/RCrashCourse/2_qplot.html)
```{r}
library(ggplot2)
source("http://bit.ly/theme_pub")
theme_set(theme_pub())
```
The `source` function loads an external file, in this case from the internet. The file is just a script saved as .R file with a custom function defining different aspects of the graph (e.g. text size, line width, etc.) You can open the link in a web browser or download and open in a text editor to see the script
The `theme_set()` command sets our custom theme (`theme_pub`) as the default plotting theme. Since the theme is a function in R, we need to include the brackets, even though there is nothing needed inside the function: `theme_pub()`
<br>
***
<br>
## Rules of thumb
Before we dig into the code, it's worth reviewing some more general graphical concepts. Standards of practice for published graphs in professional journals can vary depending on format (e.g. print vs online), audience, and historical precedent. Nevertheless, there are a number of useful 'rules of thumb' to keep in mind. These are not hard and fast rules but helpful for new researchers who aren't sure how or where to start. In making decisions, always think of your audience and remember that the main goal is to communicate information as clearly and efficientl as possible.
**1. Minimize 'ink'**
In the old days, when most papers were actually printed and mailed to journal subscribers, black ink was expensive and printing in colour was very expensive. Printing is still expensive but of course most research articles are available online where there is no additional cost to colour or extra ink. However, the concept of minimizing ink (or pixels) can go a long way toward keeping a graph free from clutter and unnecessary distraction.
**2. Use space wisely**
Empty space is not necessarily bad, but ask yourself if it is necessary and what you want the reader to take away. Consider the next two graphs:
```{r, echo=F}
Y<-rnorm(100)+60
X<-rbinom(100,1,0.5)
Y<-Y+(10*X)
X<-as.factor(X)
qplot(X,Y)
```
```{r, echo=F}
qplot(X,Y) + scale_y_continuous(limits=c(0,max(100)))
```
In the first example, the Y-axis is scaled to the data. In the second case, Y-axis scaled between 0 and 100.
What are the benefits/drawbacks of scaling the axes? When might you choose to use one over the other?
**3. Choose a colour palette**
Colour has three basic components
a. **Hue** -- the relative proportion of red vs green vs blue light
b. **Saturation** -- how vivid the colour is
c. **Brightness** -- the amount of white (vs black) in the colour
The abbreviation HSB is often used, or HSL (L = 'Lightness') or HSV (V = 'Value').
In R these can be easily defined with the `rgb()` function. For example:
* `rgb(1,0,0)` -- a saturated red
* `rgb(0.1,0,0)` -- a dark red (low brightness, low saturation)
* `rgb(1,0.9,0.9)` -- a light red (high brightness, low saturation)
Don't underestimate the impact of choosing a good colour palette, especially for presentations. Colour theory can get a bit overwhelming but here are a few good websites to help:
* Quickly generate your own palette using Coolors: https://coolors.co
* Use a colour wheel to find complementary colours using Adobe : https://color.adobe.com/create
**4. Colours have meaning**
What's wrong with this graph?
```{r, echo=F}
X<-rnorm(100)
Y<-X+seq_along(X)
D<-data.frame(Temperature=Y,Location=X,Temp=Y/3)
qplot(Location, Temperature,colour=Temp, data=D) +
scale_color_gradient(high="blue", low="red")
```
Technically, there is nothing wrong. But we naturally associate colours with particular feelings. In this case, intuitively we associate red with hot and blue with cold, which is the opposite of what is shown in this graph. Be mindful of these associations when choosing a colour palette
Another important consideration is that not everyone sees colour the same way. About 5% to 10% of the population has colour blindness. In order to make colour graphs readable to everyone, you should make sure to use colours that can still be interpreted when printed in greyscale.
**5. Maximize contrast**
Colours that are too similar will be hard to distinguish
```{r, echo=F}
X<-rnorm(100)
Y<-X+seq_along(X)
D<-data.frame(Lat=X,Long=Y,Precip=rnorm(100))
qplot(Lat,Long,colour=Precip, data=D) +
scale_color_gradient(high="#56B4E9", low="#56B499")
```
Can you see the gradient of colours?
**6. Keep relevant information**
Make sure to include proper axis **labels** (i.e. names) and **tick marks** (i.e. numbers or categories showing the different values). These labels, along with the figure caption, should act as a stand-alone unit. The reader should be able to understand the figure without having to read through the rest of the paper.
**7. Choose the right graph**
Often the same data can be presented in different ways but some are easier to interpret than others. Think carefully about the story you want to present and the main ideas you want your reader to get from your figures. Look at these two graphs that show the same data.
```{r}
ADat<-data.frame(X=rnorm(100), Trait="A")
BDat<-data.frame(X=5 + rnorm(100) *5 + ADat$X * 5, Trait="B")
PDat<-rbind(ADat,BDat)
qplot(X, fill=Trait, posit="dodge", data=PDat)
qplot(ADat$X, BDat$X)
```
The first graph tells a story about the distributions -- the mean and variance of each trait. The second graph tells a story about the correlated relationship between trait A and trait B. One is not necessarily better than the other. It depends on the story you want to tell.
<br>
***
<br>
## Example
Now that we have gone over some basic graphing concepts, let's look at how to build a professional-grade figure. In fact, we'll reconstruct a figure published in a [paper by Colautti & Lau](https://doi.org/10.1111/mec.13162) in the journal Molecular Ecology (2015).
### Setup
#### Import data
Download selection dataset from Colautti & Lau (2015), published in the journal **Molecular Ecology**: https://doi.org/10.1111/mec.13162
The paper is a meta-analysis and review of evolution occurring during biological invasions. We will recreate Figure 2, which shows the result of a meta-analysis of selection gradients ($\beta$) and selection differentials ($s$). First, we'll just recreate the top panel, and then we'll look at ways to make more advanced multi-panel graphs like this.
The data from the paper are archived on Dryad: https://datadryad.org/stash/dataset/doi:10.5061/dryad.gt678
You could download the zip file and look for the file called `Selection_Data.csv` and save it to your working directory.
Or you can just download directly from this website:
```{r}
SelData<-read.csv(
"https://colauttilab.github.io/RCrashCourse/Selection_Data.csv")
```
We are also going to change the names from the file to make them a bit more intuitive and easier to work with in R.
```{r}
names(SelData)<-c("Collector", "Author", "Year", "Journal",
"Vol", "Species", "Native", "N",
"Fitness.measure", "Trait", "s",
"s.SE", "s.P", "B", "B.SE", "B.P")
```
### Inspect
Let's take a quick look at the data
```{r}
head(SelData)
```
It's worth taking some time to look at this to understand how a meta-analysis works. The **collector** column indicates the paper that the data came from. The **Author** indicates the author(s) of the original paper that reported the data. The **Year**, **Journal**, and **Vol** give information about the publication that the data came from originally.
We can see above the collector `KingsolverDiamond`, which represents a paper from Kingsolver and Diamond that was itself a meta-analysis of natural selection. Most of the studies came from this meta-analysis, but a few of the more recent papers were added by grad students, denoted by initials:
```{r}
unique(SelData$Collector)
```
**Species** is the study species, and **Native** is its status as a binary yes/no variable. **N** is the sample size and **Fitness.measure** is the specific trait that defines fitness. **Trait** is the trait on which selection was measured. Finally, $s$ is the **selection differential** and $\beta$ is the **selection gradient**. Note that these are slopes in units of relative fitness per trait standard deviation. This is explained in more detail below.
### Absolute Value
In this analysis, we are interested in the magnitude but not the direction of natural selection. In other words we would want to treat a slope of -4 the same as a slope of +4 because they have the same magnitude. Therefore, we can replace the $s$ column with $|s|$
```{r}
SelData$s<-abs(SelData$s)
```
We'll also add a couple of columns with random variables that we can use later to explore additional plotting options.
First, a column of values sampled from a z-distribution -- this is a Gaussian (a.k.a. 'normal') distribution with mean = 0 and sd = 1.
```{r}
SelData$Rpoint<-rnorm(nrow(SelData))
```
Second, a columnn of 1 and 0 sampled randomly with equal frequency ($p = 0.5$)
```{r}
SelData$Rgroup<-sample(c(0,1), nrow(SelData), replace=T)
```
### Missing values
Check for missing data (denoted NA)
```{r, eval=FALSE}
print(SelData$s)
```
We can subset to remove missing data
```{r}
SelData<-SelData[!is.na(SelData$s),]
```
Recall from the [R Fundamentals Tutorial](https://colauttilab.github.io/RCrashCourse/1_fundamentals.html) that `!` means 'not' or 'invert'
There is also has a convenient `drop_na` function in the `tidyr` package
```{r, eval=F}
library(tidyr)
SelData<-SelData %>%
drop_na(s)
```
<br>
***
<br>
## Measuring Selection
### Don't Panic!
We're going to get a bit technical here. Don't worry if you don't completely understand all of the stuff below about measuring selection. Keep it for reference in case you decide you want to use it for your own research. For now, just try to understand it as well as you can and focus on the code used to producing the figures.
A simple analysis of phenotypic selection was proposed by Lande & Arnold (1983) and provides a simple but powerful tool for measuring natural selection. It is just a linear model with **relative fitness** on the y-axis and the **standardized trait value** on the x-axis.
### Relative Fitness
**Fitness** can be measured in many ways, such as survival or lifetime seed or egg production. Check out the list of specific fitness measures used in these studies:
```{r, eval = F}
unique(SelData$Fitness.measure)
```
**Absolute fitness** is just the observed value (e.g. seed set or survival yes/no). **Relative fitness** is just the *absolute* fitness divided by the mean. **Absolute fitness** is usually denoted by the capital letter $W$ and **relative fitness** is usually represented by a lower-case $w$ or omega $\omega$. Expressing this in mathematical terms:
$$ \omega = W_i/\bar W $$
### Trait Value
A **Trait Value** is literally just the measured trait. Use `unique(SelData$Trait)` to see the list of specific traits that were measured in these studies. The **Standardized Trait Value** is the traits z-score. See the [Distributions Tutorial](https://colauttilab.github.io/RIntroStats/1_Distributions.html) for more information about z-scores. To calculate the z-score, we take each value, subtract the mean, and then divide by the standard deviation.
$$\frac{x_i-\bar X}{s}$$
Since traits have different metrics, they are hard to compare: e.g. days to flower, egg biomass, foraging intensity, aggression. But standardizing traits to z-scores puts them all on the same scale for comparison. Specifically, the scale is in standard deviations from the mean.
### $s$ vs $\beta$
Selection differentials ($s$) and selection gradients ($\beta$) measure selection using linear models but represent slightly different measurements. Linear models are covered in the [Linear Models Tutorial](https://colauttilab.github.io/RIntroStats/3_LinearModels.html).
Both models use relative fitness ($\omega$) as the response variable ($Y$).
Selection differentials ($s$) measure selection on only a single trait, ignoring all other traits. In theory, the response to selection is a simple function of the genetic correlation between a trait and fitness. Some of this theory, with examples in R, is covered in the [Population Genetics Tutorials](https://colauttilab.github.io/PopGen/3_PopGenFUN_II.html#Natural_Selection).
Fitness differences among individuals can depend on a lot of things -- genetic variation for the trait itself, but also environmental effects on the trait as well as effects on other traits that are under selection and correlated with the trait of interest.
Selection gradients ($\beta$) measure selection on a trait of interest while also accounting for selection on other correlated traits. This is done via a **multiple regression** -- a linear model with multiple predictors.
<br>
***
<br>
## `ggplot` vs `qplot`
We can create similargraph using `qplot` and ggplot, but the syntax can be quite different.
### Histogram
Let's start with a simple `qplot`
```{r, error=TRUE}
BarPlot<-qplot(s, data=SelData, fill=Native, geom="bar")
print(BarPlot)
```
Now let's try with `ggplot`. We'll make it into an object that we can work with.
```{r, error=TRUE}
BarPlot<-ggplot(aes(s, fill=Native), data=SelData)
```
### `aes`
Note the use of the aesthetic function `aes()`. This defines the data that we want to use for our `ggplot` graph. We will see how we do this by adding layers to our plot, similar to the way old-timey cartoons were made by layering multiple clear pages of cellphane with characters painted on them. The `aes` function inside of the `ggplot` function defines that data that will be shared among all of the layers. In addition, we can have separate `aes` functions inside different `geom_` layers that define and restrict the plotting data to that specific layer.
Let's look at the `ggplot` object so far:
```{r}
print(BarPlot)
```
No data! This is one key difference between `qplot` and `ggplot`. The former has default `geoms` that it applies depending on the type of input data. We can modify geoms with the `geom = "NAME"` parameter in `qplot`, where 'NAME' is the specific name of the geom we want to use. In the above `qplot` example we used `geom = "bar"`.
In `ggplot` we have to explictly add geoms as overlapping layers using `+ geom_NAME`, where name is the specific name of the geom -- the same text that would go in quotation marks for the geom in `qplot`. The advantage of using `ggplot` is that its easier to create multiple, overlapping layers with different geoms from types data sources.
### Layers
So far, we have only loaded in the data info for plotting. We have to specify which geom(s) we want.
```{r}
BarPlot<- BarPlot + geom_bar()
BarPlot
```
Compare this graph and code to the above `qplot` figure.
Let's explore the components of our BarPlot object:
```{r}
summary(BarPlot)
```
This shows us the columns of **data** available for plotting, the **mapping** for our x and y axes, and our **fill** colours. It also shows some of the functions and parameters used to generate the graph. At the bottom we see parameters for `geom_bar` and `stat_count`. Note that there are more parameters listed than what we explicitly put into the `ggplot()` function. These extra parameters are the **default parameters** for the function.
### `geom_` and `stat_`
If *geoms* are the *geometry* of the shapes in the plot, *stats* are the *statistics* or mathematical functions that create the geoms. In the above case, the bars in `geom_bar` are created by counting the number of observations in each bin. The `stat_count` function is responsible for this calculation, and it is called by default when we use the `geom_bar` function.
Ws we can change the geometry of the plotted shapes with `geom_<NAME>`, and we can define different functions for generating the geometric shapes with `stat_<NAME>` (in both cases <NAME> is a specific name like `bar` or `count`. To make things easier on us, there is a default *stat* for each *geom*, so in most cases we can just focus on which geometry we want for our graph, and use the ignore the *stat* (i.e. use the default).
For more information on the parameters and stast of `geom_bar()` or any geom, use the R help function.
```{r, eval=FALSE}
?geom_bar
```
### Bivariate geom
Let's explore a few more plotting options. Here we'll use the random normal values we generated above so that we can make a bivariate plot:
```{r}
BivPlot<-ggplot(data=SelData, aes(x=s, y=Rpoint)) +
geom_point()
print(BivPlot)
```
Notice how the points are all clustered to the left. This looks like a classic log-normal variable, so let's log-transform $s$ (x-axis)
```{r}
BivPlot<-ggplot(data=SelData, aes(x=log(s+1), y=Rpoint)) +
geom_point()
print(BivPlot)
```
### `geom_smooth`
A really handy feature of `ggplot` is the `geom_smooth` function, which has several options for calculating and plotting a statistical model to the observations.
Here's a simple linear regression slope (lm = linear model):
```{r}
BivPlot +
geom_smooth(method="lm", colour="steelblue", size=2)
```
We can use a grouping variable to add separate regression lines for each group
```{r}
BivPlot +
geom_smooth(method="lm" ,size=2, aes(group=Native, colour=Native))
```
<br>
***
<br>
## Full ggplot
Now that we've done a bit of exploration, let's try to recreate the selection histograms from Colautti & Lau:
1. Create separate data for native vs. introduced species
2. Use a bootstrap to estimate non-parametric mean and 95% confidence intervals
3. Plot all of the components on a single graph
### Separate data
Since this is a relatively simple resampling model, we'll use two separate vectors to keep track of each interation: one for native and one for non-native.
```{r}
NatSVals<-SelData$s[SelData$Native=="yes"]
IntSVals<-SelData$s[SelData$Native=="no"]
```
An alternative would be to set up a data frame and keep track of values as separate columns, with a different row for each iteration.
### Bootstrap
The graph includes a bootstrap model to estimate the mean and variance for each group ($native=yes$ vs $no$). Bootstrapping is covered in the [Bootstrapping and Randomization Tutorial](https://colauttilab.github.io/EcologyTutorials/bootstrap.html). The example below is not the most efficient approach but it applies the flow control concepts covered in the [R Fundamentals Tutorial](https://colauttilab.github.io/RCrashCourse/1_fundamentals.html)
#### Data Setup
First we define the number of iterations and set up two objects to hold the data from our bootstrap loops.
```{r}
IterN<-100 # Number of iterations
NatSims<-{} # Dummy objects to hold output
IntSims<-{}
```
#### `for` loop
Here we apply our `for` loop. in each round, we:
1. Sample, with replacement and calculate average
2. Store average in `NatSims` (native species) or `IntSims` (non-native species)
```{r}
for (i in 1:IterN){
NatSims[i]<-mean(sample(
NatSVals, length(NatSVals), replace=T))
IntSims[i]<-mean(sample(
IntSVals, length(IntSVals), replace=T))
}
```
Note in the above code we use 'nested' functions. First we `sample`, then calculate the `mean`.
#### 95% CI
Confidence Intervals (CI) are calculated from the bootstrap output.
First, sort the datea from low to high
```{r}
NatSims<-sort(NatSims)
IntSims<-sort(IntSims)
```
Each of the objects contains a number of values equal to our `Iter` variable defined above. Now we identify the lower 2.5% and upper 97.5% values in each vector. For example, with 1000 iterations our 2.5% would be the 25th value in the sorted vector and the upper 97.5% would be the 975th value in the sorted vector.
We use this number to index the vector with square brackets. We make sure to round to a whole number since we can't have a fractional cell position.
```{r}
CIs<-c(NatSims[round(IterN*0.025,0)], # Lower 2.5%
NatSims[round(IterN*0.975,0)], # Upper 97.5%
IntSims[round(IterN*0.025,0)], # Lower 2.5%
IntSims[round(IterN*0.975,0)]) # Upper 97.5%
```
Note that the output is a vector of length 4:
```{r}
print(CIs)
```
Note: your numbers should be similar but won't be exact because each random sample will be slightly different.
> What line could we add above t ensure that these numbers were exactly the same for everyone who ran this code?
### Plot data
We'll combine the separate bootstrap vectors into a single data.frame object to make it easier to incorporate into our `ggplot` functions.
```{r}
HistData<-data.frame(s=SelData$s,Native=SelData$Native)
```
Now we can add layers to the plot. We'll print out each layer as we go, so that we can see what each layer adds to the overall graph. The coding is a bit complex here, so don't worry if it's hard to follow everything. The key thing to understand is how the different geoms contribute to the final plot.
```{r}
p <- ggplot() + theme_classic()
p <- p + geom_freqpoly(data=HistData[HistData$Native=="yes",],
aes(s,y=(..count..)/sum(..count..)),
alpha = 0.6,colour="#1fcebd",size=2)
print(p) # native species histogram
p <- p + geom_freqpoly(data=HistData[HistData$Native=="no",],
aes(s,y=(..count..)/sum(..count..)),
alpha = 0.5,colour="#f53751",size=2)
print(p) # introduced species histogram
p <- p + geom_rect(aes(xmin=CIs[1],xmax=CIs[2],ymin=0,ymax=0.01),
colour="white",fill="#1fcebd88")
print(p) # native species 95% CI bar
p <- p + geom_line(aes(x=mean(NatSims),y=c(0,0.01)),
colour="#1d76bf",size=1)
print(p) # native species bootstrap mean
p <- p + geom_rect(aes(xmin=CIs[3],xmax=CIs[4],ymin=0,ymax=0.01),
colour="white",fill="#f5375188")
print(p) # introduced species 95% CI bar
p <- p + geom_line(aes(x=mean(IntSims),y=c(0,0.01)),
colour="#f53751",size=1)
print(p) # introduced species bootstrap mean
p <- p + ylab("Frequency") +
scale_x_continuous(limits = c(0, 1.5))
print(p) # labels added, truncated x-axis
```
Another important point to note is that case we leave the `ggplot()` function empty because each geom uses different data. The data for each geom is defined by separate `aes` functions inside of each geom.
<br>
***
<br>
## Multi-graph
In the [qplot Tutorial](https://colauttilab.github.io/RCrashCourse/2_qplot.html) we saw a quick and easy way to make multiple graphs for different groups using the `facets` parameter.
### `facets`
We can also use facets with `gpplot`, just using a slightly different syntax giving a couple of options:
1. `facet_grid` lets us define a grid and set the vertical and horizontal variables
2. `facet_wrap` is a convenient option if only have one categorical variable but many categories
NOTE: one little tricky part of facets with `ggplot` is that it uses `vars` instead of the `aes` function to indicate which categorical variables from the original dataset should be used to subset the graphs.
Returning to the `BivPlot` example above:
```{r}
BivPlot<-ggplot(data=SelData, aes(x=log(s+1), y=Rpoint)) +
geom_point()
BivPlot + facet_grid(rows=vars(Native), cols=vars(Collector))
```
```{r}
BivPlot<-ggplot(data=SelData, aes(x=log(s+1), y=Rpoint)) +
geom_point()
BivPlot + facet_wrap(vars(Year))
```
### `grid.extra` package
Facets produce graphs that all have the same dimension and the same x- and y-axes. We might call these 'homogeneous' plots because they use a homogeneous format. For publications though, we might want to includ 'heterogeneous' plots with different axes and different sizes. The `gridExtra` package provides options for this.
Install with `install.packages("gridExtra")` if you haven't installed it already, then load the library
```{r}
library(gridExtra)
```
#### `grid.arrange()`
Use this to combine heterogeous ggplot objects into a single multi-panel plot.
Note that this will print graphs down rows, then across columns, from top left to bottom right. You can use `nrow` and `ncol` to control the layout in a grid format.
```{r, warning=F, message=F}
HistPlot<-p # Make a more meaningful name
grid.arrange(HistPlot,BivPlot,BarPlot,ncol=1)
grid.arrange(HistPlot,BivPlot,BarPlot,nrow=2)
```
> Note: You might get some warnings based on missing values or wrong binwidth. You will also see some weird things with different text sizes in the graphs. Normally, you would want to fix these for a final published figure but here we are just focused on showing what is possible with the layouts.
### `grid` package
What if we want to have graphs of different sizes? Or what if we want one figure to be inside another? We can make some even more advanced graphs using the `grid` package. This is part of the base installation of R so you don't need to use `install.packages()` this time.
```{r}
library(grid)
```
First, we set up the plotting area with `grid.newpage`
```{r, warning=F, message=F}
grid.newpage() # Open a new page on grid device
```
To insert a new graph on top (or inside) the current graph, we use `pushViewport` to set up an imaginary plotting grid. In this case, imagine breaking up the plotting space into 3 rows by 2 columns.
```{r, warning=F, message=F}
pushViewport(viewport(layout = grid.layout(3, 2)))
```
Finally, we print each `ggplot` object, specifying which grid(s) to plot in.
Add the first figure in row 3 and across columns 1:2
```{r, warning=F, message=F}
print(HistPlot, vp = viewport(layout.pos.row = 3,
layout.pos.col = 1:2))
```
Add the next figure across rows 1 and 2 of column 1
```{r, warning=F, message=F}
print(BivPlot, vp = viewport(layout.pos.row = 1:2,
layout.pos.col = 1))
```
Add the final figure across rows 1 and 2 of column 2
```{r, warning=F, message=F}
print(BarPlot, vp = viewport(layout.pos.row = 1:2,
layout.pos.col = 2))
```
#### Inset
We can also use `pushViewport` to set up a grid for plotting on top of an existing graph or image. This can be used to generate a figure with an inset.
First generate the 'background' plot. Note that you could alternatively load an image here to place in the background.
```{r, eval=F}
HistPlot
```
Next, overlay an invisible 4x4 grid layout (number of cells, will determine size/location of graph)
```{r, eval=F}
pushViewport(viewport(layout = grid.layout(4, 4)))
```
Finally, add the graph. In this case we want it only in the top two rows and the right-most two columns -- i.e. the top-right corner.
```{r, eval=F}
print(BivPlot, vp = viewport(layout.pos.row = 1:2, layout.pos.col = 3:4))
```
The final product:
```{r, warning=F, message=F}
HistPlot
pushViewport(viewport(layout = grid.layout(4, 4)))
print(BivPlot, vp = viewport(layout.pos.row = 1:2, layout.pos.col = 3:4))
```
<br>
***
<br>
## Further Reading
The 2009 book *ggplot2: Elegant Graphics for Data Analysis* by Hadly Wickham is the definitive guide to all things ggplot.
A physical copy is published by Springer
http://link.springer.com/book/10.1007%2F978-0-387-98141-3
And there is a free ebook version: https://ggplot2-book.org/