-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
executable file
·157 lines (109 loc) · 5.19 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
---
title: "NanoporeMet"
output:
md_document:
variant: gfm
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
message = FALSE,
warning = FALSE,
comment = "#>"
)
```
# NanoporeMet
The goal of this repository is to contain the scripts to analyze (`nanoporemet.py`)
and visualize (`app.R`, `coverage.py`) metagenomic sequencing data
generated by Oxford Nanopore Technologies sequencing devices. Both viral and bacterial
analyses are possible.
## nanoporemet.py
`nanoporemet.py` analyzes metagenomic sequencing reads with *kraken2*.
As to whether only viral or also bacterial analysis should be performed can be
decided through the selection of the *kraken2* database.
`nanoporemet.py` first concatenates all `.fastq.gz` files of each barcode
within `/fastq_pass`, then runs *kraken2* on all of them individually, and finally
combines all *kraken2* output files (i.e. from each barcode) into one file, either
`virus.kraken.txt` or `virus_bacteria.kraken.txt` (depending on the selected database).
If `nanoporemet.py` is run after the sequencing run has finished and the
`sequencing_summary_*.txt` file is available, a `sequencing_summary.pdf` file is
created which plots histograms of the mean Q scores and read lengths of all reads
as well as reads passing the quality filter.
### How to run
1. Enter timavo.
`ssh timavo`
2. Activate kraken2.
`conda activate kraken`
3. Move into the sequencing output directory, i.e., the one where you find, e.g.,
the `fastq_pass` subdirectory, or the `sequencing_summary_*.txt` file at the end
of the sequencing run.
`cd /data/GridION/GridIONOutput/<experiment>/<sample>/<flowcell>/`
4. Run the python script.
`python <path to script>/nanoporemet.py`
5. The script asks you whether you want to analyze bacterial reads (in addition
to only viral reads).
Reply with either `yes`/`y` or `no`/`n`.
### Input
#### Metagenomic sequencing data
Within the sequencing output directory, the script looks for the `/fastq_pass`
subdirectory and analyzes all `.fastq.gz` files.
#### *kraken2* databases
`nanoporemet.py` uses one of two *kraken2* databases to analyze the reads.
The paths to these databases are to be found within the script and can easily be adjusted.
The current databases are as follows:
- viral database: `k2_human-viral_20240111`
- viral + bacterial database: `k2_human-viral_20240111`
#### Run statistics
For the creation of the histogram plots, the script looks for `sequencing_summary_*.txt`
within the sequencing output directory. If it is not available yet, this step is
simply skipped.
### Output
#### *kraken2* analysis
The *kraken2* report with the analysis of all barcodes is saved in the sequencing
output directory. Depending on the selection of the *kraken* database, the report
is saved as `virus.kraken.txt` or `virus_bacteria.kraken.txt`.
#### Run statistics
The histogram plots of the mean Q scores and read lengths of all reads as well as
the reads passing the quality filter are all saved in `sequencing_summary.pdf`,
which is also found within the sequencing output directory.
## Shiny app
The `app.R` script is a Shiny app which serves to visualize the *kraken2* report
as generated by `nanoporemet.py`. Simply upload `virus.kraken.txt` or
`virus_bacteria.kraken.txt` to the app, select a barcode and choose whether you
want to analyze viral or bacterial reads, on either species or genus level.
Endogenous retroviruses and phages as well as *blocklisted* viruses can be hidden
from the output (the blocklist can be updated within `app.R`).
The Shiny app shows the taxonomic distribution of the reads in a barplot as well as
a list with all found virus or bacterial species or genera within the sample
(per barcode).
## coverage.py
The `coverage.py` automates coverage plot generation for Oxford Nanopore Technologies
reads. First, it concatenates all reads within `/fastq_pass` and then maps those
reads to a desired reference sequence (indexed `.fasta` file) using *minimap2*.
### How to run
1. Enter timavo.
`ssh timavo`
2. Activate minimap2.
`conda activate minimap2`
3. Move into the sequencing output directory, i.e., the one where you find, e.g.,
the `fastq_pass` subdirectory.
`cd /data/GridION/GridIONOutput/<experiment>/<sample>/<flowcell>/`
4. Run the python script.
`python <path to script>/coverage.py`
5. You will be asked to enter the path to the indexed reference sequence.
### Input
#### Metagenomic sequencing data
Within the sequencing output directory, the script looks for the `/fastq_pass`
subdirectory and analyzes all `.fastq.gz` files.
#### Reference sequence
The path to the reference sequence is provided by the user upon running the script.
Make sure the reference sequence is indexed and stored in
`/analyses/ONT_analyses/bwa/references/<virus/bacteria>/<name>/`.
To index the reference `.fasta` file, move into `/analyses/ONT_analyses/bwa/` and run:
`./bwa index ./references/<virus/bacteria>/<name>/*.fasta`.
### Output
#### Coverage plot
Within the sequencing output directory, you will find a new subdirectory with the
name of the reference sequence. Next to the coverage plot (PDF), it also contains
the `.sam`, `.bam`, and `.coverage` files.