README.Rmd

---
title: "NanoporeMet"
output: 
  md_document:
    variant: gfm
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  message = FALSE,
  warning = FALSE,
  comment = "#>"
)
```

# NanoporeMet
The goal of this repository is to contain the scripts to analyze (`nanoporemet.py`) 
and visualize (`app.R`, `coverage.py`) metagenomic sequencing data
generated by Oxford Nanopore Technologies sequencing devices. Both viral and bacterial
analyses are possible.

## nanoporemet.py
`nanoporemet.py` analyzes metagenomic sequencing reads with *kraken2*.
As to whether only viral or also bacterial analysis should be performed can be
decided through the selection of the *kraken2* database.

`nanoporemet.py` first concatenates all `.fastq.gz` files of each barcode
within `/fastq_pass`, then runs *kraken2* on all of them individually, and finally
combines all *kraken2* output files (i.e. from each barcode) into one file, either
`virus.kraken.txt` or `virus_bacteria.kraken.txt` (depending on the selected database).
If `nanoporemet.py` is run after the sequencing run has finished and the
`sequencing_summary_*.txt` file is available, a `sequencing_summary.pdf` file is
created which plots histograms of the mean Q scores and read lengths of all reads
as well as reads passing the quality filter.


### How to run
1. Enter timavo.

  `ssh timavo`

2. Activate kraken2.

  `conda activate kraken`

3. Move into the sequencing output directory, i.e., the one where you find, e.g.,
the `fastq_pass` subdirectory, or the `sequencing_summary_*.txt` file at the end
of the sequencing run.

  `cd /data/GridION/GridIONOutput/<experiment>/<sample>/<flowcell>/`

4. Run the python script.

  `python <path to script>/nanoporemet.py`

5. The script asks you whether you want to analyze bacterial reads (in addition 
to only viral reads).

Reply with either `yes`/`y` or `no`/`n`.


### Input
#### Metagenomic sequencing data
Within the sequencing output directory, the script looks for the `/fastq_pass`
subdirectory and analyzes all `.fastq.gz` files.

#### *kraken2* databases
`nanoporemet.py` uses one of two *kraken2* databases to analyze the reads.
The paths to these databases are to be found within the script and can easily be adjusted.
The current databases are as follows:

- viral database: `k2_human-viral_20240111`

- viral + bacterial database: `k2_human-viral_20240111`

#### Run statistics
For the creation of the histogram plots, the script looks for `sequencing_summary_*.txt`
within the sequencing output directory. If it is not available yet, this step is
simply skipped.


### Output
#### *kraken2* analysis
The *kraken2* report with the analysis of all barcodes is saved in the sequencing
output directory. Depending on the selection of the *kraken* database, the report
is saved as `virus.kraken.txt` or `virus_bacteria.kraken.txt`.

#### Run statistics
The histogram plots of the mean Q scores and read lengths of all reads as well as
the reads passing the quality filter are all saved in `sequencing_summary.pdf`,
which is also found within the sequencing output directory.


## Shiny app
The `app.R` script is a Shiny app which serves to visualize the *kraken2* report 
as generated by `nanoporemet.py`. Simply upload `virus.kraken.txt` or 
`virus_bacteria.kraken.txt` to the app, select a barcode and choose whether you 
want to analyze viral or bacterial reads, on either species  or genus level. 
Endogenous retroviruses and phages as well as *blocklisted* viruses can be hidden 
from the output (the blocklist can be updated within `app.R`).

The Shiny app shows the taxonomic distribution of the reads in a barplot as well as
a list with all found virus or bacterial species or genera within the sample
(per barcode).


## coverage.py
The `coverage.py` automates coverage plot generation for Oxford Nanopore Technologies
reads. First, it concatenates all reads within `/fastq_pass` and then maps those
reads to a desired reference sequence (indexed `.fasta` file) using *minimap2*.


### How to run
1. Enter timavo.

  `ssh timavo`

2. Activate minimap2.

  `conda activate minimap2`

3. Move into the sequencing output directory, i.e., the one where you find, e.g.,
the `fastq_pass` subdirectory.

  `cd /data/GridION/GridIONOutput/<experiment>/<sample>/<flowcell>/`

4. Run the python script.

  `python <path to script>/coverage.py`

5. You will be asked to enter the path to the indexed reference sequence.


### Input
#### Metagenomic sequencing data
Within the sequencing output directory, the script looks for the `/fastq_pass`
subdirectory and analyzes all `.fastq.gz` files.

#### Reference sequence
The path to the reference sequence is provided by the user upon running the script.
Make sure the reference sequence is indexed and stored in

`/analyses/ONT_analyses/bwa/references/<virus/bacteria>/<name>/`.

To index the reference `.fasta` file, move into `/analyses/ONT_analyses/bwa/` and run:

`./bwa index ./references/<virus/bacteria>/<name>/*.fasta`.


### Output
#### Coverage plot
Within the sequencing output directory, you will find a new subdirectory with the
name of the reference sequence. Next to the coverage plot (PDF), it also contains
the `.sam`, `.bam`, and `.coverage` files.