forked from microbiome/tutorials
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathall.Rmd
executable file
·451 lines (324 loc) · 13 KB
/
all.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
---
title: "Differential abundance testing"
bibliography:
- bibliography.bib
output:
BiocStyle::html_document:
number_sections: no
toc: yes
toc_depth: 4
toc_float: true
self_contained: true
thumbnails: true
lightbox: true
gallery: true
use_bookdown: false
highlight: haddock
---
<!--
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{microbiome tutorial - univariate}
%\usepackage[utf8]{inputenc}
%\VignetteEncoding{UTF-8}
%\setkeys{Gin}{width=\linewidth,height=\textheight,keepaspectratio}
-->
```{r, echo=FALSE, message=FALSE, warning=FALSE}
# Handle citations
require(knitcitations)
require(bookdown)
# cleanbib()
# options("citation_format" = "pandoc")
bib <- read.bibtex("bibliography.bib")
#opts_chunk$set(fig.width=4, fig.height=3, par=TRUE, out.width='2in', fig.pos='H')
library(knitr)
library(ggplot2)
library(gridExtra)
knitr::opts_chunk$set(fig.path = "figure/", dev="CairoPNG")
theme_set(theme_bw(20))
```
# Differential abundance testing: univariate data
This section covers basic univariate tests for two-group comparison,
covering t-test, Wilcoxon test, and multiple testing.
The following example compares the abundance of a selected bug between
two conditions. Let us assume that the data is already properly
normalized.
Let us load example data
```{r univariate1, warning=FALSE, message=FALSE}
library(microbiome)
data(dietswap)
d <- dietswap
# Pick microbial abundances for a given taxonomic group
taxa <- "Dialister"
# Construct a data.frame with the selected
# taxonomic group and grouping
df <- data.frame(Abundance = abundances(d)[taxa,],
Group = meta(d)$nationality)
```
Compare the groups visually using a boxplot (left). However, we
observe that the abundances are in absolute scale and therefore the
comparison is not clear. Let us try the log10 transformation. Now,
the data contains many zeros and taking log10 will yield infinite
values. Hence we choose the commonly used, although somewhat
problematic, log10(1+x) transformation (right).
```{r univariate_boxplot, warning=FALSE, message=FALSE}
p1 <- ggplot(df, aes(x = Group, y = Abundance)) +
geom_boxplot() +
labs(title = "Absolute abundances", y = "Abundance\n (read count)")+ theme(plot.title = element_text(size=18))
# Let us add the log10(1+x) version:
df$Log10_Abundance <- log10(1 + df$Abundance)
p2 <- ggplot(df, aes(x = Group, y = Log10_Abundance)) +
geom_boxplot() +
labs(title = "Log10 abundances", y = "Abundance\n (log10(1+x) read count)")+ theme(plot.title = element_text(size=18))
library(patchwork)
print(p1 + p2)
```
The groups seem to differ. Let us test the difference statistically.
First, let us perform t-test, which is based on Gaussian assumptions.
Each group is expected to follow Gaussian distribution.
Significance p-value with t-test:
```{r univariate2b, warning=FALSE, message=FALSE}
print(t.test(Log10_Abundance ~ Group, data = df)$p.value)
```
Now let us investigate the Gaussian assumption in more
detail. Boxplots may not show deviations from Gaussian assumptions
very clearly Let us try another visualization; the density plot.
```{r univariate_densityplot, warning=FALSE, message=FALSE}
p <- ggplot(df, aes(fill = Group, x = Log10_Abundance)) +
geom_density(alpha = 0.5)
print(p)
```
Apparently, the data is not Gaussian distributed. In such cases, a common
procedure is to use non-parametric tests. These do not make
assumptions of the data distribution but instead compare the ordering
of the samples.
So, let us look at the significance p-value with Wilcoxon test (log10 data):
```{r univariate_wilcoxon, warning=FALSE, message=FALSE}
print(wilcox.test(Log10_Abundance ~ Group, data = df)$p.value)
```
But since the test is non-parametric, we can as well use the original absolute abundances since the
log transformation does not change sample ordering on which the Wilcoxon test is based.
Let us verify that the absolute abundances yield the same p-value for Wilcoxon test:
```{r univariate_wilcoxon2, warning=FALSE, message=FALSE}
print(wilcox.test(Abundance ~ Group, data = df)$p.value)
```
Let us compare how much the results would differ in the whole data
between t-test and Wilcoxon test. To remove non-varying taxa that
would demand extra scripting, let us for demonstration purposes now
focus on core taxa that are observed in more than 20% of the samples
with more than 3 reads.
```{r univariate4, warning=FALSE, message=FALSE}
# Core taxa to be tested
test.taxa <- core_members(d, prevalence = 20/100, detection = 3)
# Calculate p-values with the two different methods for each taxonomic unit
pvalue.ttest <- c()
pvalue.wilcoxon <- c()
for (taxa in test.taxa) {
# Create a new data frame for each taxonomic group
df <- data.frame(Abundance = abundances(d)[taxa,],
Log10_Abundance = log10(1 + abundances(d)[taxa,]),
Group = meta(d)$nationality)
pvalue.ttest[[taxa]] <- t.test(Log10_Abundance ~ Group, data = df)$p.value
pvalue.wilcoxon[[taxa]] <- wilcox.test(Abundance ~ Group, data = df)$p.value
}
# Arrange the results in a data.frame
pvalues <- data.frame(taxon = test.taxa,
pvalue.ttest = pvalue.ttest,
pvalue.wilcoxon = pvalue.wilcoxon)
# Note that multiple testing occurs.
# We must correct the p-values.
# let us apply the standard Benjamini-Hochberg False Discovery Rate (FDR)
# correction
pvalues$pvalue.ttest.adjusted <- p.adjust(pvalue.ttest)
#pvalues$pvalue.ttest.adjusted <- p.adjust(pvalues$pvalue.ttest)
pvalues$pvalue.wilcoxon.adjusted <- p.adjust(pvalue.wilcoxon)
```
Compare the p-value histograms between raw and adjusteed p-values.
```{r univariate5, warning=FALSE, message=FALSE}
library(reshape2)
library(tidyverse)
pvalues$pvalue.wilcoxon<- as.numeric(pvalue.wilcoxon)
p1 <- ggplot(pvalues, aes(x = pvalue.wilcoxon)) + geom_histogram(bins = 50, binwidth = .03)+
labs(title = "Raw p-values") +
ylim(c(0, 80))
p2 <- ggplot(pvalues, aes(x = pvalue.wilcoxon.adjusted)) +
geom_histogram(bins = 50, binwidth = .03) +
labs(title = "Adjusted p-values") +
ylim(c(0, 80))
print(p1 + p2)
```
Now compare these adjusted p-values between t-test and Wilcoxon test. Let us
also highlight the p = 0.05 intervals.
```{r univariate6, warning=FALSE, message=FALSE}
p <- ggplot(data = pvalues,
aes(x = pvalue.ttest.adjusted,
y = pvalue.wilcoxon.adjusted)) +
geom_text(aes(label = taxon)) +
geom_abline(aes(intercept = 0, slope = 1)) +
geom_hline(aes(yintercept = 0.05), shape = 2) +
geom_vline(aes(xintercept = 0.05), shape = 2)
print(p)
```
# Linear models: the role of covariates
This section provides a brief hands-on introduction to the practical motivations
and use linear (and generalized linear) models.
Let us compare two groups with a linear model. We use Log10 abundances
since this is closer to the Gaussian assumptions than the absolute
count data. We can fit a linear model with Gaussian variation as follows:
```{r lm, warning=FALSE, message=FALSE}
res <- glm(Log10_Abundance ~ Group, data = df, family = "gaussian")
```
Let us investigate model coefficients
```{r lm_coefs, warning=FALSE, message=FALSE}
kable(summary(res)$coefficients, digits = 5)
```
The intercept equals to the mean in the first group:
```{r lm_coefs3, warning=FALSE, message=FALSE}
print(mean(subset(df, Group == "AAM")$Log10_Abundance))
```
The group term equals to the difference between group means:
```{r lm_coefs3b, warning=FALSE, message=FALSE}
print(mean(subset(df, Group == "AFR")$Log10_Abundance) -
mean(subset(df, Group == "AAM")$Log10_Abundance))
```
Note that the linear model (default) significance equals to t-test assuming equal variances.
```{r lm_signif, warning=FALSE, message=FALSE}
print(t.test(Log10_Abundance ~ Group, data = df, var.equal=TRUE)$p.value)
```
An important advantage of linear models, compared to plain t-test is
that they allow incorporating additional variables, such as potential
confounders (age, BMI, gender..):
```{r lm_glm, warning=FALSE, message=FALSE}
# Add a covariate:
df$sex <- meta(d)$sex
# Fit the model:
res <- glm(Log10_Abundance ~ Group + sex, data = df, family = "gaussian")
```
## Generalized linear models: a very brief overview
Let us briefly discuss the ideas underlying generalized linear models
before proceeding to the next section.
The Generalized linear model (GLM) allows a richer family of
probability distributions to describe the data. Intuitively speaking,
GLMs allow the modeling of nonlinear, nonsymmetric, and nongaussian
associations. GLMs consist of three elements:
- A probability distribution (from exponential family)
- A linear predictor η = Xβ .
- A link function g such that $E(Y) = μ = g^{−1}(η)$.
We use Poisson with (its natural) log-link. Fit abundance (read
counts) assuming that the data is Poisson distributed, and the
logarithm of its mean, or expectation, is obtained with a linear
model.
```{r lm_glm2, warning=FALSE, message=FALSE}
res <- glm(Abundance ~ Group, data = df, family = "poisson")
```
Investigate the model output:
```{r lm_glm2b, warning=FALSE, message=FALSE}
kable(summary(res)$coefficients, digits = 5)
```
# Advanced models of differential abundance
GLMs are the basis for advanced testing of differential abundance in
sequencing data. This is necessary, as the sequencing data sets
deviate from symmetric, continuous, Gaussian assumptions in many ways.
Sequencing data consists of discrete counts:
```{r pooled2, warning=FALSE, message=FALSE}
print(abundances(d)[1:5,1:3])
```
The data is sparse:
```{r pooled3, warning=FALSE, message=print}
hist(log10(1 + abundances(d)), 100)
```
Long tails of rare taxa:
```{r tail, warning=FALSE, message=print}
medians <- apply(abundances(d),1,median)/1e3
library(reshape2)
A <- melt(otu_tibble(d))
A$FeatureID <- factor(A$FeatureID, levels = rev(names(sort(medians))))
p <- ggplot(A, aes(x = FeatureID, y = value)) +
geom_boxplot() +
labs(y = "Abundance (reads)", x = "Taxonomic Group") +
scale_y_log10() +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, size=6))
print(p)
```
Overdispersion (variance exceeds the mean):
```{r pooled_overdispersion, warning=FALSE, message=FALSE}
means <- apply(abundances(d),1,mean)
variances <- apply(abundances(d),1,var)
# Calculate mean and variance over samples for each taxon
library(reshape2)
library(dplyr)
df <- melt(otu_tibble(d))
names(df) <- c("Taxon", "Sample", "Reads")
df <- df %>% group_by(Taxon) %>%
summarise(mean = mean(Reads),
variance = var(Reads))
# Illustrate overdispersion
library(scales)
p <- ggplot(df, aes(x = mean, y = variance)) +
geom_point() +
geom_abline(aes(intercept = 0, slope = 1)) +
scale_x_log10(labels = scales::scientific) +
scale_y_log10(labels = scales::scientific) +
labs(title = "Overdispersion (variance > mean)")
print(p)
```
## DESeq2: differential abundance testing for sequencing data
DESeq2 analysis can accommodate those particular assumptions about
sequencing data.
```{r pooled_deseq, warning=FALSE, message=FALSE}
# Start by converting phyloseq object to deseq2 format
library(DESeq2)
ds2 <- phyloseq_to_deseq2(d, ~ group + nationality)
# Run DESeq2 analysis (all taxa at once!)
dds <- DESeq(ds2)
# Investigate results
res <- results(dds)
deseq.results <- as.data.frame(res)
df <- deseq.results
df$taxon <- rownames(df)
df <- df %>% arrange(log2FoldChange, padj)
# Print the results; flitered and sorted by pvalue and effectsize
library(knitr)
df <- df %>% filter(pvalue < 0.05 & log2FoldChange > 1.5) %>%
arrange(pvalue, log2FoldChange)
kable(df, digits = 5)
```
For comparison purposes, assess significances and effect sizes based on Wilcoxon test.
```{r pooled_topw, warning=FALSE, message=FALSE}
test.taxa <- taxa(d)
pvalue.wilcoxon <- c()
foldchange <- c()
for (taxa in test.taxa) {
# Create a new data frame for each taxonomic group
df <- data.frame(Abundance = abundances(d)[taxa,],
Log10_Abundance = log10(1 + abundances(d)[taxa,]),
Group = meta(d)$nationality)
# Calculate pvalue and effect size (difference beween log means)
pvalue.wilcoxon[[taxa]] <- wilcox.test(Abundance ~ Group, data = df)$p.value
foldchange[[taxa]] <- coef(lm(Log10_Abundance ~ Group, data = df))[[2]]
}
# Correct p-values for multiple testing
pvalue.wilcoxon.adjusted <- p.adjust(pvalue.wilcoxon)
```
```{r pooled_pcomp, warning=FALSE, message=FALSE}
par(mfrow = c(1,2))
plot(deseq.results$padj, pvalue.wilcoxon.adjusted,
xlab = "DESeq2 adjusted p-value",
ylab = "Wilcoxon adjusted p-value",
main = "P-value comparison")
abline(v = 0.05, h = 0.05, lty = 2)
plot(deseq.results$log2FoldChange, foldchange,
xlab = "DESeq2",
ylab = "Linear model",
main = "Effect size comparison")
abline(0,1)
```
## Visualizing DESEQ2 results
```{r fig.height=6, fig.width=8}
deseq.results %>%
tibble::rownames_to_column("Taxon") %>%
filter(padj <= 0.05 & abs(log2FoldChange) >= 1.5) %>%
ggplot(aes(log2FoldChange, reorder(Taxon, log2FoldChange))) +
geom_col() +
labs(x="log2FoldChange", y="Taxon")
```