-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy pathDMM.Rmd
executable file
·115 lines (91 loc) · 3.62 KB
/
DMM.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
---
title: "Dirichlet Multinomial Mixtures"
author: "Leo Lahti, Sudarshan Shetty et al."
bibliography:
- bibliography.bib
output:
BiocStyle::html_document:
number_sections: no
toc: yes
toc_depth: 4
toc_float: true
self_contained: true
thumbnails: true
lightbox: true
gallery: true
use_bookdown: false
highlight: haddock
---
<!--
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{microbiome tutorial - DMM}
%\usepackage[utf8]{inputenc}
%\VignetteEncoding{UTF-8}
-->
## Community typing with Dirichlet Multinomial Mixtures
[Dirichlet Multinomial Mixtures (DMM)](https://doi.org/10.1371/journal.pone.0030126) (Quince et al. 2012) is a probabilistic method for community typing (or clustering) of microbial community profiling data. It is an infinite mixture model, which means that the method can infer the optimal number of community types. Note that the number of community types is likely to grow with data size.
Let us load example data.
```{r DMM0, fig.width=6, fig.height=5, warning=FALSE, message=FALSE}
library(microbiome)
library(DirichletMultinomial)
library(reshape2)
library(magrittr)
library(dplyr)
# Load example data
data(dietswap)
pseq <- dietswap
# To speed up, only consider the core taxa
# that are prevalent at 0.1% relative abundance in 50% of the samples
# (note that this is not strictly correct as information is
# being discarded; one alternative would be to aggregate rare taxa)
pseq.comp <- microbiome::transform(pseq, "compositional")
taxa <- core_members(pseq.comp, detection = 0.1/100, prevalence = 50/100)
pseq <- prune_taxa(taxa, pseq)
# Pick the OTU count matrix
# and convert it into samples x taxa format
dat <- abundances(pseq)
count <- as.matrix(t(dat))
```
Fit the DMM model. Let us set the maximum allowed number of community types to 3 to speed up the example.
```{r DMM, fig.width=6, fig.height=5, warning=FALSE, message=FALSE, eval=TRUE}
fit <- lapply(1:3, dmn, count = count, verbose=TRUE)
```
Check model fit with different number of mixture components using standard information criteria
```{r DMMplot, fig.width=6, fig.height=5, warning=FALSE, message=FALSE, eval=TRUE}
lplc <- base::sapply(fit, DirichletMultinomial::laplace) # AIC / BIC / Laplace
aic <- base::sapply(fit, DirichletMultinomial::AIC) # AIC / BIC / Laplace
bic <- base::sapply(fit, DirichletMultinomial::BIC) # AIC / BIC / Laplace
#plot(lplc, type="b", xlab="Number of Dirichlet Components", ylab="Model Fit")
#lines(aic, type="b", lty = 2)
#lines(bic, type="b", lty = 3)
```
Pick the optimal model
```{r DMM3, fig.width=6, fig.height=5, warning=FALSE, message=FALSE,error=FALSE, eval=TRUE}
best <- fit[[which.min(unlist(lplc))]]
```
Mixture parameters pi and theta
```{r DMM4, fig.width=6, fig.height=5, warning=FALSE, message=FALSE, eval=TRUE}
mixturewt(best)
```
Sample-component assignments
```{r DMM5, fig.width=6, fig.height=5, warning=FALSE, message=FALSE, eval=TRUE}
ass <- apply(mixture(best), 1, which.max)
```
Contribution of each taxonomic group to each component
```{r DMM6, fig.width=9, fig.heigth=6, out.width="400px", warning=FALSE, message=FALSE}
for (k in seq(ncol(fitted(best)))) {
d <- melt(fitted(best))
colnames(d) <- c("OTU", "cluster", "value")
d <- subset(d, cluster == k) %>%
# Arrange OTUs by assignment strength
arrange(value) %>%
mutate(OTU = factor(OTU, levels = unique(OTU))) %>%
# Only show the most important drivers
filter(abs(value) > quantile(abs(value), 0.8))
p <- ggplot(d, aes(x = OTU, y = value)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = paste("Top drivers: community type", k))
print(p)
}
```