-
Notifications
You must be signed in to change notification settings - Fork 51
/
Copy pathPop_Structure.Rmd
299 lines (243 loc) · 11.3 KB
/
Pop_Structure.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
```{r,message=FALSE,echo=FALSE}
html <- TRUE
library("knitcitations")
library("knitr")
cite_options(citation_format = "pandoc", max.names = 3, style = "html", hyperlink = "to.doc")
bib <- read.bibtex("bibtexlib.bib")
opts_chunk$set(tidy = FALSE, message = FALSE, warning = FALSE,
fig.width = 10, fig.height = 6, cache = TRUE)
if (html) opts_chunk$set(out.width = "700px", dpi = 300)
# use this to set knitr options:
# http://yihui.name/knitr/options #chunk_options
```
---
title: 'Population structure: $G_{ST}$, genetic distance, and clustering'
subtitle: '*ZN Kamvar, SE Everhart and NJ Grünwald*'
---
In this chapter we explore various ways of assessing if population are
structured (e.g., differentiated). You can think of population structure as
identifying clusters or groups of more closely related individuals resulting
from reduced gene flow among these groups. Populations can be studied to
determine if they are structured by using, for example, population
differentiation summary statistics (e.g. $G_{ST}$), clustering or [minimum
spanning networks](Minimum_Spanning_Networks.html). Note, that this chapter will
utilize many data sets due to the unique features offered by each. Let's first
look at an example of population differentiation based on $G_{ST}$.
$G_{ST}$ an example with *Felis catus* data.
----
Assessing genetic diversity almost always starts with an analysis of a parameter
such as $G_{ST}$. There are [lengthy debates](http://www.molecularecologist.com/2011/03/should-i-use-fst-gst-or-d-2/)
as to what measure of differentiation is better `r citep(bib['meirmans2011assessing'])`.
Instead of going into that lengthy debate, it would be more worthwhile to point
you into the direction of a package dedicated to Modern Methods of
Differentiation called *mmod*. We will use the data set *nancycats* containing
17 colonies of cats collected from Nancy, France. As cats tend to stay within
small groups, we expect to see some population differentiation. In terms of
these diversity measures, an index of $G_{ST} = 0$ indicates no differentiation,
whereas $G_{ST} = 1$ indicates that populations are segregating for differing
alleles.
Let's load the package and the example data set:
```{r, loadstuff}
library("mmod")
data("nancycats")
nancycats
```
Now we will use Hendrick's standardized $G_{ST}$ to assess population structure
among these populations `r citep(bib['hedrick2005standardized'])`.
```{r, nancycats}
Gst_Hedrick(nancycats)
```
What does this output tell us?
Next we will look at genetic distance to find related groups of individuals.
Genetic Distance
----
If we wanted to analyze the relationship between individuals or populations, we
would use genetic distance measures which calculate the "distance" between
samples based on their genetic profile. These distances can be visualized with
heatmaps, dendrograms, or minimum spanning networks. In the package *poppr*,
there are several distances available:
| Distance | Function | Marker type | Can handle missing data |
| ---- | ---- | ---- | ---- | ---- |
| Bruvo's distance | `bruvo.dist` | microsatellite | yes |
| Edwards' distance | `edwards.dist` | any | no |
| Nei's distance | `nei.dist` | any | no |
| Provesti's distance | `provesti.dist` | any | yes |
| Reynolds' distance | `reynolds.dist` | any | no |
| Rogers' distance | `rogers.dist` | any | no |
| Provesti's distance | `bitwise.dist` | SNP | yes |
One common way to visualize a genetic distance is with a dendrogram. For this
example, we will use the *microbov* data set `r citep(bib["laloe2007consensus"])`.
This contains information on 704 cattle from both Africa and France over several
different breeds. We can create a dendrogram over all 704 samples, but that
would be difficult to visualize. For our purposes, let's take ten random samples
and calculate Provesti's distance, which will return the fraction of the number
of differences between samples:
```{r, cattlefirst}
library("poppr")
library("ape") # To visualize the tree using the "nj" function
library("magrittr")
data(microbov)
set.seed(10)
ten_samples <- sample(nInd(microbov), 10)
mic10 <- microbov[ten_samples]
(micdist <- provesti.dist(mic10))
```
The above represents the pairwise distances between these 10 samples. We will
use this distance matrix to create a neighbor-joining tree.
```{r, cattleplot, fig.width = 5, fig.height = 7}
theTree <- micdist %>%
nj() %>% # calculate neighbor-joining tree
ladderize() # organize branches by clade
plot(theTree)
add.scale.bar(length = 0.05) # add a scale bar showing 5% difference.
```
Notice that the sample names start with either "AF" or "FR". This indicates
their country of origin and we are seeing that the populations cluster
correspondingly. Of course, a tree is a hypothesis and one way of generating
support is to bootstrap loci. This can be achieved with the *poppr* function
`aboot`.
```{r, cattleboot, fig.width = 5, fig.height = 7}
set.seed(999)
aboot(mic10, dist = provesti.dist, sample = 200, tree = "nj", cutoff = 50, quiet = TRUE)
```
The bootstrap value of 100 on the node separating the French and African samples
gives support that the country of origin is a factor in how these breeds are
structured. If we wanted to analyze all of the breeds against one another, it
would be better to create a bootstrapped dendrogram based on a genetic distance.
To do this, we will add 3 stratifications to the microbov data set: Country,
Breed, and Species. We will then set the population to Country by Breed, convert
the data to a genpop object and then create a tree using `aboot` with Nei's
genetic distance.
```{r, cattlepopboot, fig.width = 5, fig.height = 7, cache = TRUE}
# Setting up the data
strata(microbov) <- data.frame(other(microbov))
microbov
nameStrata(microbov) <- ~Country/Breed/Species
# Analysis
set.seed(999)
microbov %>%
genind2genpop(pop = ~Country/Breed) %>%
aboot(cutoff = 50, quiet = TRUE, sample = 1000, distance = nei.dist)
```
Now we can see that, in all 1,000 bootstrapped trees, the African and French
samples were each in separate clades. Of course, dendrograms are only one type
of analysis you can use genetic distances for. Below is a table describing some
of the different analyses for which you can utilize genetic distance:
| Analysis | Function | Package | Note |
| ---- | ----- | ---- | ---- |
| Bootstrapped dendrograms | `aboot` | *poppr* | |
| Mantel Test | `mantel.randtest` | *ade4* | To be used with geographic distance matrix |
| Principle Coordinates Analysis | `cmdscale` | *stats* | |
| [DAPC](DAPC.html) | `dapc` | *adegenet* | Convert to matrix with `as.matrix` |
| [Minimum Spanning Networks](Minimum_Spanning_Networks.html) | `poppr.msn` | *poppr* | requires a distance matrix; cannot handle genpop |
K-means hierarchical clustering
----
A recent study reported that the origin of the potato late blight pathogen
*Phytophthora infestans* lies in Mexico as opposed to South America `r citep(bib['goss2014irish'])`.
We saw in the previous chapter that South American populations showed signatures
of clonal reproduction while Mexican populations showed no evidence rejecting
the null hypothesis of random mating. In this section, we will use K-means
clustering in combination with bootstrapped dendrograms to see how well this
pattern holds up. Clonal populations should have short terminal branch lengths
and should cluster according to those branches. Panmictic populations will show
no clear pattern. Let's look at the data:
```{r}
library("poppr")
data("Pinf")
Pinf
```
First, we will perform a cluster analysis:
```{r, eval = FALSE}
MX <- popsub(Pinf, "North America")
MXclust <- find.clusters(MX)
```
![MX_PCA](Pop_Structure_files/figure-html/MXPCA.png)
```{r, echo = FALSE}
cat("Choose the number PCs to retain (>=1): \n")
```
```{r, echo = FALSE, message=TRUE, comment= ">"}
message("50")
```
PC stands for principal components, which are unit-less transformations of your data
that explain the variance observed. For the purposes of `find.clusters`, we
can keep as many as we want.
![MX_CLUSTER](Pop_Structure_files/figure-html/MXCLUST.png)
```{r, echo = FALSE}
cat("Choose the number clusters (>=2): \n")
```
```{r, echo = FALSE, message=TRUE, comment= ">"}
message("3")
```
BIC stands for "Bayesian Information Criterion". The lower the BIC value, the
better. On the x axis are the number of clusters. We see that there is a bend at
3 clusters, indicating that the data clusters optimally into three groups.
Note that the function `find.clusters()` includes the parameter `max.n.clust` which specifies the maximum number of clusters to explore.
The default value is `round(sample_size/10)`.
If you would like to explore a higher level of groups relative to your sample size you should consider reparameterizing this value.
```{r, echo = FALSE}
MX <- popsub(Pinf, "North America")
MXclust <- find.clusters(MX, n.pca = 50, n.clust = 3)
```
And now we can see the cluster assignments:
```{r}
MXclust
```
We will go through the same procedure for the South American population.
```{r, eval = FALSE}
SA <- popsub(Pinf, "South America")
SAclust <- find.clusters(SA)
```
![SA_PCA](Pop_Structure_files/figure-html/SAPCA.png)
```{r, echo = FALSE}
cat("Choose the number PCs to retain (>=1): \n")
```
```{r, echo = FALSE, message=TRUE, comment= ">"}
message("30")
```
![SA_CLUSTER](Pop_Structure_files/figure-html/SACLUST.png)
```{r, echo = FALSE}
cat("Choose the number clusters (>=2): \n")
```
```{r, echo = FALSE, message=TRUE, comment= ">"}
message("4")
```
Notice here that there is no local minimum in the curve.
This indicates that there [might not be enough information in the data set to properly cluster](http://lists.r-forge.r-project.org/pipermail/adegenet-forum/2011-June/000303.html), or that you might have more clusters than the default value (in which case you may wish to include n.max.clust as an argument to the find.clusters function, and specify a higher value than the default).
We will go ahead by choosing the highest number of clusters:
```{r, echo = FALSE}
SA <- popsub(Pinf, "South America")
SAclust <- find.clusters(SA, n.pca = 30, n.clust = 4)
```
### Trees
Now we will build trees. We are using Bruvo's distance since polyploids bias
calculation of other distances:
```{r, trees, fig.width = 5, fig.height = 7}
pinfreps <- c(2, 2, 6, 2, 2, 2, 2, 2, 3, 3, 2)
MXtree <- bruvo.boot(MX, replen = pinfreps, cutoff = 50, quiet = TRUE)
SAtree <- bruvo.boot(SA, replen = pinfreps, cutoff = 50, quiet = TRUE)
```
We see very long terminal branches in the MX tree. Let's see how the groups we
found with the clustering algorithm match up:
```{r, grouptree, fig.width = 5, fig.height = 7}
library("ape")
cols <- rainbow(4)
plot.phylo(MXtree, cex = 0.8, font = 2, adj = 0, tip.color = cols[MXclust$grp],
label.offset = 0.0125)
nodelabels(MXtree$node.label, adj = c(1.3, -0.5), frame = "n", cex = 0.8,
font = 3, xpd = TRUE)
axisPhylo(3)
```
You can see that the assigned clusters don't necessarily group with the
dendrogram clusters. Let's see what happens when we view this with the South
American population:
```{r, grouptreeSA, fig.width = 5, fig.height = 7}
plot.phylo(SAtree, cex = 0.8, font = 2, adj = 0, tip.color = cols[SAclust$grp],
label.offset = 0.0125)
nodelabels(SAtree$node.label, adj = c(1.3, -0.5), frame = "n", cex = 0.8,
font = 3, xpd = TRUE)
axisPhylo(3)
```
Everything clusters together nicely, further supporting a non-panmictic
population.
References
----------