-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
399 lines (299 loc) · 13 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%",
dpi = 150
)
```
# phylotax
<!-- badges: start -->
[![R build status](https://github.com/brendanf/phylotax/workflows/R-CMD-check-bioc/badge.svg)](https://github.com/brendanf/phylotax/actions)
[![Codecov test coverage](https://codecov.io/gh/brendanf/phylotax/branch/master/graph/badge.svg)](https://codecov.io/gh/brendanf/phylotax?branch=master)
<!-- badges: end -->
## Installation
Install the development version from [GitHub](https://github.com/) with:
```r
# install.packages("devtools")
devtools::install_github("brendanf/phylotax")
```
## Overview
The PHYLOTAX algorithm takes as input taxonomic annotations from one or more
primary taxonomic assignment algorithms, and refines them using a taxonomic tree.
The refinements are of two basic types:
1. Tips of the tree which are unassigned are assigned to a taxon if the tree
supports their inclusion in that taxon.
2. Conflicting assignments between multiple primary algorithms are resolved
using the tree.
The `phylotax` package also includes the wrapper function `taxonomy()` which
assigns taxonomy to sequences using
[DADA2](https://benjjneb.github.io/dada2/assign.html),
[IDTAXA](http://www2.decipher.codes/Classification.html),
or [SINTAX](https://www.drive5.com/usearch/manual/cmd_sintax.html), along the
`taxtable()` function which converts the results to a uniform format.
### Example data
Here is an example of a tree:
```{r, fig.asp = 0.5}
library(phylotax)
plot(example_tree(), show.node.label = TRUE)
```
Here is a set of taxonomic assignments for the tips of the tree, based on
two hypothetical primary assignment algorithms "XTAX" and "YTAX".
The required columns are "label", "rank", and "taxon", which identify individual OTUs,
taxonomic ranks, and taxonomic assignments.
The OTU labels are the same as the tip labels on the tree,
in this case the letters A-F, although some are missing from the taxonomy table
because the algorithms could not place them.
Only assignments at the rank of genus are present in this small example,
and the genera in question are called "Tax1" and "Tax2".
Our example also has a "method" column, which PHYLOTAX uses to identify which
assignments come from the same source.
```{r}
example_taxa()
```
If we sort by the taxon label, we can see that XTAX and YTAX disagree about the
assignment of tips B and D.
Neither algorithm has placed tips A and E, and XTAX also failed to place tip F.
```{r}
dplyr::arrange(example_taxa(), label)
```
### Use PHYLOTAX
Use PHYLOTAX to resolve conflicts and assign additional tips where the tree
supports it.
```{r}
phylotax_out <- phylotax(tree = example_tree(), taxa = example_taxa())
```
PHYLOTAX returns a list of class "`phylotax`" containing the tree,
taxa assignments for tips and internal nodes, as well as tables dividing the
primary assignments into those which were rejected, those which were retained,
and those which were missing from the input tree.
```{r}
phylotax_out$assigned
```
```{r}
phylotax_out$retained
```
```{r}
phylotax_out$rejected
```
PHYLOTAX has used the following logic:
1. It's not possible to decide what the root (node 1) is, because one of its
direct children (tip A) is completely unassigned.
2. It's not possible to decide node 2, because there are differences between
assignments for two of its descendents (tip C and tip F).
3. All of the descendents of node 3 (tip B and tip C) have at least one
assignment of Tax2. PHYLOTAX removes all conflicting assignments (XTAX's
assignment of Tax1 to tip B) and gives its own assignment of Tax2 to node 3
and all its children.
4. All of the descendents of node 4 (tips D, E, and F) either have an assignment
of Tax1 (D and F) or are unassigned (E). Furthermore, both branches coming
from node 4 (tip D and node 5) do have some assignments. PHYLOTAX removes
the conflicting assignments (XTAX's assignment of Tax2 to tip D) and gives its
own assignment of Tax1 to node 4 and all its children.
5. At node 5, there is nothing to do, because PHYLOTAX already assigned it to
Tax1 in step 4.
### Continued analysis
If you are continuing on with analysis using the
[phyloseq](https://joey711.github.io/phyloseq/index.html) package, then you can
easily create a `phyloseq` object from a `phylotax` object and an OTU table.
To demonstrate, we first create a random presence/absence OTU table for 5
samples and the 6 OTUS present in our example dataset.
Each species has a 50% chance to be present in each sample.
```{r}
library(phyloseq)
set.seed(1)
otus <- matrix(rbinom(30, 1, 0.5),nrow = 5, dimnames = list(1:5, LETTERS[1:6]))
otus <- otu_table(otus, taxa_are_rows = FALSE)
```
Now we can easily create the phyloseq object.
```{r}
physeq <- phylotax_to_phyloseq(phylotax_out, otus)
```
And use it for plots or whatever further analysis is needed.
```{r, fig.asp = 0.25}
plot_tree(physeq, color = "genus")
```
## Another example
Now we'll try a slightly longer example.
Some downloaded reference sequences in two database formats, as well as "unknowns",
are included as example data in `phylotax`.
This example goes through the process of aligning, building a tree, performing
taxonomic identification by several algorithms, and then refining the
assignments using the tree.
The data in question are LSU rDNA sequences belonging to the fungal order
Sebacinales, taken from the [RDP](https://rdp.cme.msu.edu/) fungal training set.
At the time of the publication of the RDP training set, much of the order
Sebacinales was grouped into a large, polyphyletic genus, *Sebacina*.
Since the work of Oberwinkler et al.<sup id="a1">[1](#f1)</sup>, the order has
been split in two families, Serendipitaceae and Sebacinaceae, and several
additional genera.
For sequences in the database which are annotated as a particular species,
we can look up the current name and classification for that species.
However, for sequences originally annotated as "*Sebacina* sp.",
"uncultured *Sebacina*", etc., there is way to decide what family/genus it should
belong in now, except by re-identifying the sequences.
This is a very small example, with only 15 reference sequences and 39 "unknown"
sequences, so it will run quickly. The references are the sequences from the
dataset that were annnotated all the way to genus level, while the unknowns are
the ones that weren't.
```{r}
unknowns <- system.file("extdata/unknowns.fasta.gz", package = "phylotax")
unknowns <- Biostrings::readDNAStringSet(unknowns)
```
Some sequences from Dacrymycetes are also included to use as an outgroup in the tree.
```{r}
outgroup <- system.file("extdata/dacrymycetes.fasta.gz", package = "phylotax")
outgroup <- Biostrings::readDNAStringSet(outgroup)
```
## Alignment and phylogeny
We'll do a quick alignment and phylogeny in R using `DECIPHER`.
```{r}
suppressPackageStartupMessages(library(DECIPHER))
aln <- AlignSeqs(c(unknowns, outgroup))
```
Trim the ends where there are a lot of gaps.
```{r}
colgaps <- colSums(as.matrix(aln) == "-")
start <- min(which(cummin(colgaps) == 0))
end <- max(which(rev(cummin(rev(colgaps)) == 0)))
aln <- Biostrings::subseq(aln, start = start, end = end)
```
Now we can make a tree.
```{r}
distmat <- DistanceMatrix(aln, correction = "Jukes-Cantor")
dendro <- IdClusters(distmat, method = "ML", type = "dendrogram", myXStringSet = aln)
```
PHYLOTAX needs the tree to be a `phylo` object from the `ape` package. We can
convert from the `dendrogram` in a two-step process.
```{r}
library(ape)
tree <- as.phylo(as.hclust(dendro))
```
How does it look?
```{r, fig.asp = 1}
plot(tree, cex = 0.5)
```
The tree is already rooted correctly, so we can move on.
## Primary taxonomic assignment
DADA2 and SINTAX require the taxonomy to be presented in a different format in
the sequence headers, so the reference file exists in two versions.
```{r}
dada_ref <- system.file("extdata/sebacinales.dada2.fasta.gz", package = "phylotax")
sintax_ref <- system.file("extdata/sebacinales.sintax.fasta.gz", package = "phylotax")
```
Let's identify the unknown sequences using a few different assignment
algorithms. Different algorithms come from different people, and so they use
different interfaces and give results in different formats. Some are implemented
in R packages, others aren't. However, the `taxonomy()` and `taxtable()`
functions in `phylotax` simplify this process by making a single wrapper that
calls different external packages.
For starters, we'll used the RDP naïve Bayesian classifier, which is also
implemented in the `DADA2` R package.
```{r}
dada2_result <- taxonomy(unknowns, reference = dada_ref, method = "dada2")
```
`DADA2` returns taxonomy as a list of matrices.
```{r}
str(dada2_result)
```
Next, we'll use another R-friendly taxonomic assignment algorithm, IDTAXA from
the `DECIPHER` package. IDTAXA requires that a model be trained using the
reference database, which can then be used over and over again to identify
different sequences. Although using the functions provided in `DECIPHER` lets you
tweak your training, `phylotax` also has a wrapper for this, which takes the
reference database in the same format as SINTAX.
```{r}
idtaxa_model <- train_idtaxa(sintax_ref)
```
Now that the model is trained, we can identify the unknown sequences.
```{r}
idtaxa_result <- taxonomy(unknowns, reference = idtaxa_model, method = "idtaxa")
```
IDTAXA gives its result as a list of lists. Here are the first three elements:
```{r}
str(idtaxa_result[1:3])
```
Finally, we'll use SINTAX, an algorithm which is implemented in USEARCH and its
open-source reimplementation, VSEARCH. For this to work, one of these two must
be installed. I'm using VSEARCH; to use USEARCH give `exec = "usearch"` as an
argument.
```{r}
sintax_result <- taxonomy(unknowns, reference = sintax_ref, method = "sintax")
```
SINTAX produces output as a delimited file, which `taxonomy()` has parsed as a
`tibble`. However this is still not the format we need for PHYLOTAX.
```{r}
str(sintax_result)
```
Fortunately, the `taxtable()` function turns any of these three formats into
the format PHYLOTAX needs:
```{r}
dada2_taxonomy <- taxtable(dada2_result, min_confidence = 0.6, names = names(unknowns))
idtaxa_taxonomy <- taxtable(idtaxa_result, min_confidence = 0.6)
sintax_taxonomy <- taxtable(sintax_result, min_confidence = 0.6)
```
Now they are all in the same format:
```{r}
head(dada2_taxonomy)
head(idtaxa_taxonomy)
head(sintax_taxonomy)
```
However, they differ in how many assignments they each made:
```{r}
nrow(dada2_taxonomy)
nrow(idtaxa_taxonomy)
nrow(sintax_taxonomy)
```
Keep in mind that assigning one sequence from kingdom to genus is a total of six assignments.
Now, we can add a "method" column to each of the taxonomy tables and merge them.
```{r}
dada2_taxonomy$method = "DADA2"
idtaxa_taxonomy$method = "IDTAXA"
sintax_taxonomy$method = "SINTAX"
combined_taxonomy = rbind(dada2_taxonomy, idtaxa_taxonomy, sintax_taxonomy)
```
## PHYLOTAX
The PHYLOTAX algorithm doesn't really make sense in an unrooted tree. In a
future version of this README, I'll include some outgroup sequences.
The `make_taxon_labels()` and `relabel_tree()` functions let us summarize the
state of the assignments before and after PHYLOTAX. The `abbrev=TRUE` argument
makes some abbreviations of common elements of fungal taxa, to make these
easier to read.
```{r, fig.width = 6, fig.height = 6}
pre_labels <- make_taxon_labels(combined_taxonomy, abbrev = TRUE)
pre_tree <- relabel_tree(tree, pre_labels$old, pre_labels$new)
plot.phylo(pre_tree, cex = 0.5)
```
For each rank, we have a number followed by an abbreviated taxon name. The
number is the number of different algorithms that assigned that taxon. When
there's a conflict, we see the alternates in angle brackets.
For an example of interpretation, look at Seq4.
Out of a total 3 possible, it got:
* 3 assignments for kingdom F(ungi)
* 3 assignments for phylum B(asidiomycota)
* 3 assignments for class Agarico(mycetes)
* 3 assignments for order Sebacin(ales)
* 3 assignments for family Sebacin(aceae)
* 1 assignment each for the genera *Sebacina* and *Tremellodendron*.
Now that we have a tree and a set of primary taxonomic assignments, it's time
for PHYLOTAX.
```{r}
p <- phylotax(tree = tree, taxa = combined_taxonomy)
```
We can look at the tree again to see how PHYLOTAX has cleaned up the assignments.
```{r, fig.width = 6, fig.height = 6}
post_labels <- make_taxon_labels(rbind(p$retained, p$assigned), abbrev = TRUE)
post_tree <- relabel_tree(tree, post_labels$old, post_labels$new)
plot.phylo(post_tree, cex = 0.5)
```
Compare this to the original annotations!
```{r, fig.width = 6, fig.height = 6}
old_tree <- relabel_tree(tree, names(sebacinales_oldnames), sebacinales_oldnames)
plot(old_tree, cex = 0.5)
```
<a name="f1">1</a> Oberwinkler, F., Riess, K., Bauer, R. et al. Morphology and molecules: the Sebacinales, a case study. Mycol Progress 13, 445–470 (2014). https://doi.org/10.1007/s11557-014-0983-1 [↩](#a1)