-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.Rmd
214 lines (160 loc) · 6.14 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# inferrnal
<!-- badges: start -->
[![R build status](https://github.com/brendanf/inferrnal/workflows/R-CMD-check-bioc/badge.svg)](https://github.com/brendanf/inferrnal/actions)
[![Codecov test coverage](https://codecov.io/gh/brendanf/inferrnal/branch/master/graph/badge.svg)](https://codecov.io/gh/brendanf/inferrnal?branch=master)
[![Lifecycle: maturing](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://www.tidyverse.org/lifecycle/#maturing)
<!-- badges: end -->
R Interface to Call Programs from Infernal RNA Covariance Model Package
Covariance Models (CM) are stochastic models of RNA sequence and secondary
structure.
[Infernal](http://eddylab.org/infernal/) (INFERence of RNA ALignment)[^2] is a
software package with various command-line tools related to CMs.
`inferrnal` (with two "`r`"s) is a lightweight R interface which calls the
Infernal tools and imports the results to R.
It is developed independently from Infernal, and Infernal must be installed in
order for it to function.
Note that Infernal does not work on Windows.
[^2]: Nawrocki, E.P., Eddy, S.R., 2013. [Infernal 1.1: 100-fold faster RNA homology
searches](https://doi.org/10.1093/bioinformatics/btt509).
Bioinformatics 29, 2933–2935.
## Installation
### Installing Infernal
The required Infernal package can be installed from Bioconda:
```
conda install -c bioconda infernal
```
or in Debian/Ubuntu Linux using apt:
```
sudo apt-get install infernal
```
or using Homebrew in MacOS:
```
brew tap brewsci/bio
brew install infernal
```
For other installation options, including source installation, see the
[Infernal homepage](http://eddylab.org/infernal/).
### Installing `inferrnal`
After Infernal is installed, the released version of `inferrnal` can be
installed from Bioconductor:
``` r
if(!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("inferrnal")
```
Submission to Bioconductor is pending.
Alternatively, the development version of `inferrnal` can be installed from
[GitHub](https://github.com/) with:
``` r
if(!requireNamespace("remotes", quietly = TRUE))
install.packages("remotes")
remotes::install_github("brendanf/inferrnal")
```
## Examples
So far three of the tools are implemented: `cmsearch`, `cmalign`, and `cmbuild`.
### cmsearch
In order to search, we need a CM.
[Rfam](https://rfam.xfam.org/) has a wide variety.
For this example, we will use the eukaryotic 5.8S rRNA, with the Rfam ID
[`RF00002`](https://rfam.xfam.org/family/RF00002).
The CM is [available from Rfam](https://rfam.xfam.org/family/RF00002/cm), but
it is also included as example data in `inferrnal`.
```{r cm}
library(inferrnal)
cm <- cm_5_8S()
```
We also need some sequences to search.
The sample data is from a soil metabarcoding study focused on fungi[^1].
The targeted region includes 5.8S as well as some of the surrounding rDNA
regions.
[^1]: Furneaux, B., Bahram, M., Rosling, A., Yorou, N.S., Ryberg, M., 2021.
[Long- and short-read metabarcoding technologies reveal similar spatiotemporal
structures in fungal communities](https://doi.org/10.1111/1755-0998.13387).
Molecular Ecology Resources 21, 1833–1849.
```{r reads}
sampfasta <- sample_rRNA_fasta()
```
Use `cmsearch()` to locate the 5.8S RNA in each sequence.
```{r cmsearch}
library(inferrnal)
cmsearch(cm = cm, seq = sampfasta, cpu = 1)
```
Instead of passing a file name, you can also supply a `DNAStringSet` or
`RNAStringSet` object from the `Biostrings` package.
```{r cmsearch-seq}
sampseqs <- Biostrings::readDNAStringSet(sampfasta)
cmsearch(cm = cm, seq = sampseqs, cpu = 1)
```
`cmsearch`, by default, returns a table with information about each hit.
However, it can optionally also output an alignment of the hits to a file
in Stockholm format.
```{r cmsearch-aln}
alnfile <- tempfile("alignment-", fileext = ".stk")
cmsearch(cm = cm, seq = sampseqs, alignment = alnfile)
```
`inferrnal` includes a parser for Stockholm alignments, which also
imports column annotations.
```{r read_msa}
msa <- read_stockholm_msa(alnfile)
```
`read_stockholm_msa` returns an object of class
`StockholmRNAMultipleAlignment`.
It is also possible to load DNA or AA alignments using an optional
`type=` argument to `read_stockholm_msa`.
```{r msa_align}
msa
```
In addition to the alignment itself, the Stockholm output includes the
consensus secondary structure and the reference annotation, as defined in the
CM, as column ("GC") annotations named "SS_cons" and "RF", as well as the
posterior probability that each base is aligned in the correct position as
residue ("GR") annotations named "PP".
For more information about these annotations, including the encoding of secondary
structure and posterior probabilities, see the
[Infernal documentation](http://eddylab.org/infernal/Userguide.pdf).
### cmalign
If you have sequences which have already been trimmed to contain only the RNA
defined by the CM (possibly truncated, but not extended), then you can align
them to the CM using `cmalign`.
This is much faster than `cmsearch`.
This example uses the results of `cmsearch` from the previous section, after
removing gaps.
```{r}
unaln <- sample_rRNA_5_8S()
unaln_seq <- Biostrings::readRNAStringSet(unaln)
unaln_seq
aln <- cmalign(cm, unaln_seq, cpu = 1)
aln
```
### cmbuild
`cmbuild` is used to create new CMs from annotated multiple sequence alignments.
To illustrate the process, we use the seed alignment for the 5.8S rRNA CM from
RFAM.
It is included as a sample file in `inferrnal`.
```{r}
new_cm <- file.path(tempdir(), "5_8S.cm")
cmbuild(new_cm, msafile = stk_5_8S(), force = TRUE, quiet = FALSE)
```
This CM is not calibrated, so it cannot be used for `cmsearch`,
but it can be used in `cmalign`.
```{r}
aln2 <- cmalign(new_cm, unaln_seq, cpu = 1)
```
The resulting alignment is the same as the one created using the CM from RFAM,
because they are based on the same seed alignment, and used the same (default)
options for `cmbuild`.
```{r}
all.equal(aln, aln2)
```