-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
479 lines (395 loc) · 15 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# tzara
<!-- badges: start -->
[![R build status](https://github.com/brendanf/tzara/workflows/R-CMD-check-bioc/badge.svg)](https://github.com/brendanf/tzara/actions)
[![Codecov test coverage](https://codecov.io/gh/brendanf/tzara/branch/master/graph/badge.svg)](https://codecov.io/gh/brendanf/tzara?branch=master)
<!-- badges: end -->
To reduce computational complexity, [dada2](https://benjjneb.github.io/dada2/index.html)
only uses non-singletons as seeds for denoising.
For this strategy to work, each true sequence must be represented by at least two identical reads.
Especially with long amplicons, the probability of two reads having exactly the same errors is much lower than the probability of being error-free, so in practice this means that each true sequence must have two error-free reads.
This becomes problematic for rare sequences in long amplicon libraries.
Tzara (named after [Tristan Tzara](https://en.wikipedia.org/wiki/Tristan_Tzara), a central figure in the Dada art movement, who described the ["cut-up" technique](https://en.wikipedia.org/wiki/Cut-up_technique) for poetry) implements an alternative method for dealing with long reads:
- Cut reads of the targeted region into smaller domains using hidden Markov models (via [Hmmer](https://hmmer.org)) or covariance models (via [Infernal](http://eddylab.org/infernal/))
- Use dada2 to separately denoise each sub-region/domain
- Concatenate denoised sequences for the different domains which originated in the same read to get denoised sequences for the full read
There are also methods for deeper chimera checking and attempting to preserve information from reads which dada2 was unable to successfully map to a read.
## Installation
`tzara` is yet available from CRAN or Bioconductor,
but you can install the latest github version using:
``` r
remotes::install_github("brendanf/tzara")
```
## Example
For the example, we will conduct all analysis in a temporary directory.
In your own analysis, use your own project directory.
```{r}
root_dir <- "temp"
```
### Example data
`dada2`, and thus `tzara`,
use statistical methods which require many sequences for meaningful results.
This typically means that an analysis with realistic diversity takes several
hours to run.
For this example, we will use simulated data with only six true unique sequences and
two samples with 1000 reads each.
If you are using this document as a template for your own analysis,
you can skip this section; start with your own demultiplexed sequences,
with primers and adapters removed.
There are three sample sequences, belonging to three diferent genera.
```{r}
library(tzara)
tzara_sample
```
To verify our ability to find variants that differ by only one base pair in ITS2,
we will create a hypothetical variant of each sequence by changing one of the
bases in ITS2.
First, mark the original sequences with an "`A`"
```{r}
originals <- tzara_sample
names(originals) <- paste0(names(originals), "A")
originals
```
Now make the variants; base 480 is in ITS2 of all three.
After checking the current base at that position (none of them are "`G`"),
change it to "`G`".
```{r}
variants <- tzara_sample
names(variants) <- paste0(names(variants), "B")
Biostrings::subseq(variants, 480, 480)
Biostrings::subseq(variants, 480, 480) <- "G"
Biostrings::subseq(variants, 480, 480)
```
We will now simulate some random sequencing errors for these reads,
using the distribution of quality scores from 1000 sequences from the original
data.
Substitution errors for each simulated read are generated with the probability
implied by the quality score at each base.
This is the best possible situation for `dada2`, because the quality scores
are perfectly calibrated, and there are no PCR errors (which do not show up in
quality scores).
```{r}
simulate_errors <- function(true_seq, qual_prof, prefix = NULL) {
n <- length(true_seq)
seq <- strsplit(as.character(true_seq), "")
slen <- vapply(seq, length, 1L)
# concatenate the whole thing as one character vector
seq <- unlist(seq)
# convert to numbers
seq <- match(seq, c("A", "C", "G", "T")) - 1L
# choose a quality profile for each sequence
qual <- lapply(
sample.int(nrow(qual_prof), length(slen), replace = TRUE),
function(x) qual_prof[x,]
)
# choose quality scores for each base
qual <- mapply(sample, size = slen, prob = qual,
MoreArgs = list(x = colnames(qual_prof), replace = TRUE))
qual <- vapply(qual, paste, "", collapse = "")
qual <- Biostrings::PhredQuality(qual)
prob <- as(qual, "NumericList")
prob <- unlist(as.list(prob))
# determine which bases have substitution errors
error <- which(rbinom(prob, size = 1, prob = prob) == 1)
# randomly change the bases
if (length(error) > 0) {
seq[error] <- (seq[error] + sample(1:3, length(error), replace = TRUE)) %% 4
}
# convert back to bases
seq <- c("A", "C", "G", "T")[seq + 1L]
# Split back into individual sequence and scores
groups <- rep(seq_along(slen), times = slen)
seq <- vapply(split(seq, groups), paste, "", collapse = "")
names(seq) <- paste0(prefix, "read", seq_len(n))
names(qual) <- names(seq)
Biostrings::QualityScaledDNAStringSet(seq, qual)
}
```
We will have 900 of each original sequence, and 100 of its single-base variant,
for a total of 3000 sequences.
They are divided randomly into two samples of 1500 reads each.
```{r}
reads <- c(rep(originals, c(900, 450, 225)), rep(variants, c(100, 50, 25)))
reads <- simulate_errors(reads, quality_profile, paste0(names(reads), "_"))
reads <- split(sample(reads), rep(paste0("Sample", 1:2), 875))
```
Save the reads into a "`demux`" directory, as though we had just demultiplexed
and trimmed them.
```{r}
demux_dir <- file.path(root_dir, "demux")
if (!dir.exists(demux_dir)) dir.create(demux_dir, recursive = TRUE)
for (r in names(reads)) {
Biostrings::writeQualityScaledXStringSet(
reads[[r]],
filepath = file.path(demux_dir, paste0(r, ".fastq.gz")),
compress = "gzip"
)
}
```
### Region extraction
Region extraction can be performed using
[`ITSx`](https://microbiology.se/software/itsx/),
which can separate an rDNA sequence into SSU, ITS1, 5.8S, ITS2, and LSU regions
using hidden markov models (HMMs), or by using
[`LSUx`](https://github.com/brendanf/LSUx), which does not,
in its current version, separate SSU from ITS1,
but which also divides LSU into alternating conserved and variable regions.
The example data is from amplicons generated using the primers ITS1 and LR5,
so they include only about 12 bp of SSU, but approximately 850 bp of LSU.
Running `ITSx` would fail to detect such a short fragment of SSU,
and keep LSU in one region, which would be a bit long for `dada2`.
Instead, we will use `LSUx`.
These commands install the R packages `inferrnal` (with two "r"'s required by `LSUx`),
`LSUx`, and `tzara` if they are not already present.
To run this notebook, you will also need to install
[Infernal](http://eddylab.org/software/infernal/)
(with one "r", required by `inferrnal`).
```{r}
if (!requireNamespace("remotes")) install.packages("remotes")
if (!requireNamespace("inferrnal")) remotes::install_github("brendanf/inferrnal")
if (!requireNamespace("LSUx")) remotes::install_github("brendanf/LSUx")
if (!requireNamespace("tzara")) remotes::install_github("brendanf/tzara")
```
`LSUx` uses covariance models to identify the different regions of LSU.
The default uses a model of 5.8S to identify the 5' end of the LSU secondary structure (which includes 5.8S),
and then a model of the full 32S precursor RNA to identify features within LSU.
It is more computationally efficient to use a truncated model, which ends at the location of the 3' primer used in the study.
Such truncated models can be generated by the `LSUx` function `truncate_alignment()`,
followed by the `inferrnal` function `cmbuild()`.
However, for the case of LR5, the primer used to generate the sample data,
the required CM is included with `LSUx`.
```{r}
cm <- system.file(file.path("extdata", "fungi_32S_LR5.cm"), package = "LSUx")
```
We now loop through all of our input files, and use `LSUx::lsux()` to find the start and end points of each region in each read.
Then, we extract each of the regions with `tzara::extract_region()`,
and put each region in its own directory.
This took my development computer about 1 minute on 8 cpus.
```{r results=FALSE}
region_dir <- file.path(root_dir, "region")
dir.create(region_dir, showWarnings = FALSE)
for (f in list.files(demux_dir, full.names = TRUE)) {
pos <- LSUx::lsux(f, cm_32S = cm, ITS1 = TRUE, global = TRUE, quiet = TRUE)
regions <- unique(pos$region)
for (r in regions) {
dir.create(file.path(region_dir, r), showWarnings = FALSE)
region_file <- file.path(region_dir, r, basename(f))
if (file.exists(region_file)) next()
tzara::extract_region(
f,
positions = pos,
region = r,
outfile = region_file
)
}
}
```
`LSUx` names the variable regions of LSU according to the V numbers of Raué et al. (1988),
which can be used for both prokaryotes and eukaryotes.
Eukaryotic studies usually use the D numbers of Michot et al. (1984).
Eukaryotic ITS2 corresponds to V1, so V2 *sensu* Raué is D1 *sensu* Michot.
Conserved regions between the variable regions are numbered LSU1, LSU2, LSU3, etc.
```{r}
regions <- list.files(region_dir, pattern = "[A_Z0-9_]+", full.names = FALSE)
regions
```
### Quality filtering
Quality filtering can be performed using the standard `dada2` methods.
However, it is helpful to use different parameters for the different regions,
especially maximum and minimum length.
Note that the conserved regions can have much tighter length limits than the
variable regions.
```{r}
filter_dir <- file.path(root_dir, "filter")
dir.create(filter_dir, showWarnings = FALSE)
maxLen <- c(
ITS1 = 500,
`5_8S` = 175,
ITS2 = 500,
LSU1 = 120,
V2 = 200,
LSU2 = 175,
V3 = 1000,
LSU3 = 60,
V4 = 500,
LSU4 = 175
)
minLen <- c(
ITS1 = 50,
`5_8S` = 125,
ITS2 = 50,
LSU1 = 90,
V2 = 125,
LSU2 = 150,
V3 = 50,
LSU3 = 30,
V4 = 50,
LSU4 = 150
)
for (r in regions) {
region_fastq <- list.files(file.path(region_dir, r), full.names = TRUE)
filter_fastq <- file.path(filter_dir, r, basename(region_fastq))
dada2::filterAndTrim(
region_fastq,
filter_fastq,
truncQ = 0,
minLen = minLen[r],
maxLen = maxLen[r],
maxN = 0,
maxEE = 3,
verbose = TRUE
)
}
```
For comparison of results,
we can also perform the analysis on the full-length reads.
```{r}
full_fastq <- list.files(demux_dir, full.names = TRUE)
filter_fastq <- file.path(filter_dir, "full", basename(full_fastq))
dada2::filterAndTrim(
full_fastq,
filter_fastq,
truncQ = 0,
minLen = 1200,
maxLen = 1800,
maxN = 0,
maxEE = 3,
verbose = TRUE
)
```
Notice we've already lost a large fraction of the reads, because they have more
than 3 expected errors.
### Dereplication
Dereplication uses the standard `dada2` method.
It's important to give `qualityType = "FastqQuality"`,
because the auto-detected quality type may be wrong for PacBio data when all of the quality scores are very high.
```{r}
dereps <- list()
for (r in regions) {
region_fastq <- list.files(file.path(filter_dir, r), full.names = TRUE)
dereps[[r]] <- dada2::derepFastq(region_fastq,
qualityType = "FastqQuality",
verbose = TRUE)
}
```
Notice that in all these cases, the number of unique sequences is between 1/3
and 1/2 the total number of sequences.
This means that there are duplicates, which are necessary for `dada2` to form
the seeds for ASVs.
However, notice that the number of unique sequences is MUCH larger than the
number of "true" error-free sequences, which is 3 for most regions,
but 6 for ITS2, because we introduced single-base variants there.
Also dereplicate the full-length sequences.
```{r}
full_fastq <- list.files(file.path(filter_dir, "full"), full.names = TRUE)
dereps[["full"]] <- dada2::derepFastq(
full_fastq,
qualityType = "FastqQuality",
verbose = TRUE
)
```
### Error fitting
Because all of the regions come from the same reads,
the same error profile should apply to all of them.
To save time, we can fit just one.
However, in order to verify that the different regions have similar error
profiles, we can fit one for a variable region (ITS2) and one for a very
conserved region (LSU4).
```{r}
errITS <- dada2::learnErrors(
fls = dereps[["ITS2"]],
errorEstimationFunction = dada2::PacBioErrfun,
multithread = TRUE,
verbose = TRUE,
pool = TRUE
)
errLSU <- dada2::learnErrors(
fls = dereps[["LSU4"]],
errorEstimationFunction = dada2::PacBioErrfun,
multithread = TRUE,
verbose = TRUE,
pool = TRUE
)
```
It is always good practice to look at the error model.
```{r}
dada2::plotErrors(errITS, nominalQ = TRUE)
dada2::plotErrors(errLSU, nominalQ = TRUE)
```
Although there is a slight decreasing trend in the fit error rates,
it appears that the PacBio quality scores are not reliable indicators for this
dataset, presumably because the errors are dominated by PCR amplification errors.
We also fit an error model to the full length reads, for comparison.
```{r}
err_full <- dada2::learnErrors(
fls = dereps[["full"]],
errorEstimationFunction = dada2::PacBioErrfun,
multithread = TRUE,
verbose = TRUE,
pool = TRUE
)
```
```{r}
dada2::plotErrors(err_full, nominalQ = TRUE)
```
This error model has almost no dependence on quality scores,
and also predicts a much higher error rate in general.
### Denoising
`dada()` is run on each of the regions independently.
```{r}
dada <- list()
for (r in regions) {
dada[[r]] <- dada2::dada(
derep = dereps[[r]],
err = errLSU,
pool = TRUE,
multithread = TRUE
)
}
```
...and the full length reads for comparison.
```{r}
dada[["full"]] <- dada2::dada(
derep = dereps[["full"]],
err = err_full,
pool = TRUE,
multithread = TRUE
)
```
We can see that the number of unique sequences is almost as high as the number
of total reads.
This means most reads are singletons.
In long reads with a moderate error rate, this is expected.
### Remapping
The objects returned by `dada()` do not include enough information to map each
raw read to its assigned ASV.
This mapping can be recreated using the `derep` object that was passed to
`dada`.
```{r}
dadamaps <- list()
for (r in regions) {
dadamaps[[r]] <- tzara::dadamap(
dereps[[r]],
dada[[r]],
region = r,
dir = file.path(filter_dir, r)
)
}
recon <- tzara::reconstruct(
dadamaps,
order = c("ITS1", "5_8S", "ITS2", "LSU1", "V2", "LSU2", "V3", "LSU3", "V4",
"LSU4"),
sample_column = "name"
)
```