Pop_Structure.Rmd

```{r,message=FALSE,echo=FALSE}
  html <- TRUE
  library("knitcitations")
  library("knitr")
  cite_options(citation_format = "pandoc", max.names = 3, style = "html", hyperlink = "to.doc")
  bib <- read.bibtex("bibtexlib.bib")
  opts_chunk$set(tidy = FALSE, message = FALSE, warning = FALSE,
                 fig.width = 10, fig.height = 6, cache = TRUE)
  if (html) opts_chunk$set(out.width = "700px", dpi = 300)
  # use this to set knitr options:
  # http://yihui.name/knitr/options #chunk_options
```
---
title: 'Population structure: $G_{ST}$, genetic distance, and clustering'
subtitle: '*ZN Kamvar, SE Everhart and NJ Gr&uuml;nwald*'
---

In this chapter we explore various ways of assessing if population are
structured (e.g., differentiated). You can think of population structure as
identifying clusters or groups of more closely related individuals resulting
from reduced gene flow among these groups. Populations can be studied to
determine if they are structured by using, for example, population
differentiation summary statistics (e.g. $G_{ST}$), clustering or [minimum
spanning networks](Minimum_Spanning_Networks.html). Note, that this chapter will
utilize many data sets due to the unique features offered by each. Let's first
look at an example of population differentiation based on $G_{ST}$.

$G_{ST}$ an example with *Felis catus* data.
----

Assessing genetic diversity almost always starts with an analysis of a parameter
such as $G_{ST}$. There are [lengthy debates](http://www.molecularecologist.com/2011/03/should-i-use-fst-gst-or-d-2/)
as to what measure of differentiation is better `r citep(bib['meirmans2011assessing'])`.
Instead of going into that lengthy debate, it would be more worthwhile to point
you into the direction of a package dedicated to Modern Methods of
Differentiation called *mmod*. We will use the data set *nancycats* containing
17 colonies of cats collected from Nancy, France. As cats tend to stay within
small groups, we expect to see some population differentiation. In terms of
these diversity measures, an index of $G_{ST} = 0$ indicates no differentiation,
whereas $G_{ST} = 1$ indicates that populations are segregating for differing
alleles.

Let's load the package and the example data set:

```{r, loadstuff}
library("mmod")
data("nancycats")
nancycats
```

Now we will use Hendrick's standardized $G_{ST}$ to assess population structure
among these populations `r citep(bib['hedrick2005standardized'])`.

```{r, nancycats}
Gst_Hedrick(nancycats)
```
What does this output tell us?

Next we will look at genetic distance to find related groups of individuals.

Genetic Distance
----

If we wanted to analyze the relationship between individuals or populations, we
would use genetic distance measures which calculate the "distance" between
samples based on their genetic profile. These distances can be visualized with
heatmaps, dendrograms, or minimum spanning networks. In the package *poppr*,
there are several distances available:

 | Distance | Function | Marker type | Can handle missing data |
 | ---- | ---- | ---- | ---- | ---- |
 | Bruvo's distance | `bruvo.dist` | microsatellite | yes |
 | Edwards' distance | `edwards.dist` | any | no |
 | Nei's distance | `nei.dist` | any | no |
 | Provesti's distance | `provesti.dist` | any | yes |
 | Reynolds' distance | `reynolds.dist` | any | no |
 | Rogers' distance | `rogers.dist` | any | no |
 | Provesti's distance | `bitwise.dist` | SNP | yes |

One common way to visualize a genetic distance is with a dendrogram. For this
example, we will use the *microbov* data set `r citep(bib["laloe2007consensus"])`.
This contains information on 704 cattle from both Africa and France over several
different breeds. We can create a dendrogram over all 704 samples, but that
would be difficult to visualize. For our purposes, let's take ten random samples
and calculate Provesti's distance, which will return the fraction of the number
of differences between samples:


```{r, cattlefirst}
library("poppr")
library("ape") # To visualize the tree using the "nj" function
library("magrittr")
data(microbov)
set.seed(10)
ten_samples <- sample(nInd(microbov), 10)
mic10       <- microbov[ten_samples]
(micdist    <- provesti.dist(mic10))
```

The above represents the pairwise distances between these 10 samples. We will
use this distance matrix to create a neighbor-joining tree.

```{r, cattleplot, fig.width = 5, fig.height = 7}
theTree <- micdist %>%
  nj() %>%    # calculate neighbor-joining tree
  ladderize() # organize branches by clade
plot(theTree)
add.scale.bar(length = 0.05) # add a scale bar showing 5% difference.
```

Notice that the sample names start with either "AF" or "FR". This indicates
their country of origin and we are seeing that the populations cluster
correspondingly. Of course, a tree is a hypothesis and one way of generating
support is to bootstrap loci. This can be achieved with the *poppr* function
`aboot`.

```{r, cattleboot, fig.width = 5, fig.height = 7}
set.seed(999)
aboot(mic10, dist = provesti.dist, sample = 200, tree = "nj", cutoff = 50, quiet = TRUE)
```

The bootstrap value of 100 on the node separating the French and African samples
gives support that the country of origin is a factor in how these breeds are
structured. If we wanted to analyze all of the breeds against one another, it
would be better to create a bootstrapped dendrogram based on a genetic distance.
To do this, we will add 3 stratifications to the microbov data set: Country,
Breed, and Species. We will then set the population to Country by Breed, convert
the data to a genpop object and then create a tree using `aboot` with Nei's
genetic distance.

```{r, cattlepopboot, fig.width = 5, fig.height = 7, cache = TRUE}
# Setting up the data
strata(microbov) <- data.frame(other(microbov))
microbov
nameStrata(microbov) <- ~Country/Breed/Species

# Analysis
set.seed(999)
microbov %>%
  genind2genpop(pop = ~Country/Breed) %>%
  aboot(cutoff = 50, quiet = TRUE, sample = 1000, distance = nei.dist)
```

Now we can see that, in all 1,000 bootstrapped trees, the African and French
samples were each in separate clades. Of course, dendrograms are only one type
of analysis you can use genetic distances for. Below is a table describing some
of the different analyses for which you can utilize genetic distance:

| Analysis | Function | Package | Note |
| ---- | ----- | ---- | ---- |
| Bootstrapped dendrograms | `aboot` | *poppr* |  |
| Mantel Test | `mantel.randtest` | *ade4* | To be used with geographic distance matrix |
| Principle Coordinates Analysis | `cmdscale` | *stats* |  |
| [DAPC](DAPC.html) | `dapc` | *adegenet* | Convert to matrix with `as.matrix` |
| [Minimum Spanning Networks](Minimum_Spanning_Networks.html) | `poppr.msn` | *poppr* | requires a distance matrix; cannot handle genpop |


K-means hierarchical clustering
----

A recent study reported that the origin of the potato late blight pathogen
*Phytophthora infestans* lies in Mexico as opposed to South America `r citep(bib['goss2014irish'])`.
We saw in the previous chapter that South American populations showed signatures
of clonal reproduction while Mexican populations showed no evidence rejecting
the null hypothesis of random mating. In this section, we will use K-means
clustering in combination with bootstrapped dendrograms to see how well this
pattern holds up. Clonal populations should have short terminal branch lengths
and should cluster according to those branches. Panmictic populations will show
no clear pattern. Let's look at the data:

```{r}
library("poppr")
data("Pinf")
Pinf
```

First, we will perform a cluster analysis:

```{r, eval = FALSE}
MX <- popsub(Pinf, "North America")
MXclust <- find.clusters(MX)
```
![MX_PCA](Pop_Structure_files/figure-html/MXPCA.png)

```{r, echo = FALSE}
cat("Choose the number PCs to retain (>=1): \n")
```
```{r, echo = FALSE, message=TRUE, comment= ">"}
message("50")
```

PC stands for principal components, which are unit-less transformations of your data
that explain the variance observed. For the purposes of `find.clusters`, we
can keep as many as we want.

![MX_CLUSTER](Pop_Structure_files/figure-html/MXCLUST.png)

```{r, echo = FALSE}
cat("Choose the number clusters (>=2): \n")
```
```{r, echo = FALSE, message=TRUE, comment= ">"}
message("3")
```

BIC stands for "Bayesian Information Criterion". The lower the BIC value, the
better. On the x axis are the number of clusters. We see that there is a bend at
3 clusters, indicating that the data clusters optimally into three groups.

Note that the function `find.clusters()` includes the parameter `max.n.clust` which specifies the maximum number of clusters to explore.
The default value is `round(sample_size/10)`.
If you would like to explore a higher level of groups relative to your sample size you should consider reparameterizing this value.


```{r, echo = FALSE}
MX <- popsub(Pinf, "North America")
MXclust <- find.clusters(MX, n.pca = 50, n.clust = 3)
```

And now we can see the cluster assignments:

```{r}
MXclust
```

We will go through the same procedure for the South American population.

```{r, eval = FALSE}
SA <- popsub(Pinf, "South America")
SAclust <- find.clusters(SA)
```
![SA_PCA](Pop_Structure_files/figure-html/SAPCA.png)

```{r, echo = FALSE}
cat("Choose the number PCs to retain (>=1): \n")
```
```{r, echo = FALSE, message=TRUE, comment= ">"}
message("30")
```

![SA_CLUSTER](Pop_Structure_files/figure-html/SACLUST.png)

```{r, echo = FALSE}
cat("Choose the number clusters (>=2): \n")
```
```{r, echo = FALSE, message=TRUE, comment= ">"}
message("4")
```

Notice here that there is no local minimum in the curve.
This indicates that there [might not be enough information in the data set to properly cluster](http://lists.r-forge.r-project.org/pipermail/adegenet-forum/2011-June/000303.html), or that you might have more clusters than the default value (in which case you may wish to include n.max.clust as an argument to the find.clusters function, and specify a higher value than the default).
We will go ahead by choosing the highest number of clusters:


```{r, echo = FALSE}
SA <- popsub(Pinf, "South America")
SAclust <- find.clusters(SA, n.pca = 30, n.clust = 4)
```

### Trees

Now we will build trees. We are using Bruvo's distance since polyploids bias
calculation of other distances:

```{r, trees, fig.width = 5, fig.height = 7}
pinfreps <- c(2, 2, 6, 2, 2, 2, 2, 2, 3, 3, 2)
MXtree <- bruvo.boot(MX, replen = pinfreps, cutoff = 50, quiet = TRUE)
SAtree <- bruvo.boot(SA, replen = pinfreps, cutoff = 50, quiet = TRUE)
```

We see very long terminal branches in the MX tree. Let's see how the groups we
found with the clustering algorithm match up:

```{r, grouptree, fig.width = 5, fig.height = 7}
library("ape")
cols <- rainbow(4)
plot.phylo(MXtree, cex = 0.8, font = 2, adj = 0, tip.color = cols[MXclust$grp],
           label.offset = 0.0125)
nodelabels(MXtree$node.label, adj = c(1.3, -0.5), frame = "n", cex = 0.8,
           font = 3, xpd = TRUE)
axisPhylo(3)
```

You can see that the assigned clusters don't necessarily group with the
dendrogram clusters. Let's see what happens when we view this with the South
American population:

```{r, grouptreeSA, fig.width = 5, fig.height = 7}
plot.phylo(SAtree, cex = 0.8, font = 2, adj = 0, tip.color = cols[SAclust$grp],
           label.offset = 0.0125)
nodelabels(SAtree$node.label, adj = c(1.3, -0.5), frame = "n", cex = 0.8,
           font = 3, xpd = TRUE)
axisPhylo(3)
```

Everything clusters together nicely, further supporting a non-panmictic
population.

References
----------