Skip to content

Commit

Permalink
initial set-up
Browse files Browse the repository at this point in the history
  • Loading branch information
Richard Stöckl committed Aug 29, 2024
1 parent 43c8176 commit e096927
Show file tree
Hide file tree
Showing 10 changed files with 503 additions and 22 deletions.
3 changes: 3 additions & 0 deletions .snakemake-workflow-catalog.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
usage:
software-stack-deployment: # definition of software deployment method (at least one of conda, singularity, or singularity+conda)
conda: true # whether pipeline works with --use-conda
56 changes: 53 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,55 @@
# QCforSeqCode
Snakemake Pipeline to check the requirements for a prokaryotic assembly to be included in the SeqCode initiative.
# Snakemake workflow: `QCforSeqCode`

The requirements are outlined in https://registry.seqco.de/page/seqcode#data-quality-necessary-for-completion-of-seqcode-registryb
Author: richard.stoeckl@ur.de

[![Snakemake](https://img.shields.io/badge/snakemake-≥8.10.0-brightgreen.svg)](https://snakemake.github.io)

## About
[Snakemake](https://snakemake.github.io) Pipeline to check the requirements for a prokaryotic assembly to be included in the [SeqCode](https://registry.seqco.de/) initiative.

The requirements are outlined in [APPENDIX I](https://registry.seqco.de/page/seqcode#data-quality-necessary-for-completion-of-seqcode-registryb) of the SeqCode.

## Usage

**[Check out the usage instructions in the snakemake workflow catalog](https://snakemake.github.io/snakemake-workflow-catalog?usage=richardstoeckl/QCforSeqCode)**

But here is a rough overview:
1. Install [conda](https://docs.conda.io/en/latest/miniconda.html) (mamba or miniconda is fine).
2. Install snakemake with:
```bash
conda install -c conda-forge -c bioconda snakemake
```
3. Download checkm2 database (via 'wget https://zenodo.org/api/files/fd3bc532-cd84-4907-b078-2e05a1e46803/checkm2_database.tar.gz')
4. Download GTDB-Tk database (via 'wget https://data.gtdb.ecogenomic.org/releases/release220/220.0/auxillary_files/gtdbtk_package/full_package/gtdbtk_r220_data.tar.gz')
3. [Download the latest release from this repo](https://github.com/richardstoeckl/basecallNanopore/releases/latest) and cd into it
4. Edit the `config/config.yaml` to provide the paths to your results/logs directories, and the paths to the databases you downloaded, as well as any parameters you might want to change.
5. Edit the `config/sampleData.csv` file with the specific details for each assembly you want to check. Depending on what you enter here, the pipeline will automatically adjust what will be done.
5. Open a terminal in the main dir and start a dry-run of the pipeline with the following command. This will download and install all the dependencies for the pipeline (this step takes may take some time) and it will show you if you set up the paths correctly:

```bash
snakemake --sdm conda -n --cores
```
6. Run the pipeline with
```bash
snakemake --sdm conda --cores
```
---

## TODO and planned features
- add 16S rRNA gene truncation check
- add automatic switches for Kingdom specific modes of some tools
- automate checkm2 and gtdb-tk database downloads
- add checks if the config file and the sample file are correctly filled


## Notes on the Test data:
- `data/GCF_000007305.1_ASM730v1_genomic.fna` - This is the reference genome of Pyrococcus furiosus, which does fit the criteria of SeqCode. It was acquired from the [RefSeq database](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000007305.1/).
- `data/GCA_015662175.1_ASM1566217v1_genomic.fna` - This is the assembly of Thermococcus paralvinellae, which does not fit the criteria of SeqCode. It was acquired from [GenBank database](https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_015662175.1/)
- `data/SRR8767914_subsampled.fastq.gz` is a [DNA-Seq of Pyrococcus furiosus DSM 3638](https://www.ncbi.nlm.nih.gov/sra/SRR8767914) dataset, that was subsampled for quicker testing via `zcat SRR8767914.fastq.gz | seqkit sample --rand-seed 42 -p 0.1 -o SRR8767914_subsampled.fastq.gz`.

```
Copyright Richard Stöckl 2024.
Distributed under the Boost Software License, Version 1.0.
(See accompanying file LICENSE or copy at
https://www.boost.org/LICENSE_1_0.txt)
```
13 changes: 13 additions & 0 deletions config/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
main:
sampleData: "sampleData.csv" # "config/sampleData_tests.csv is the sample file that can be used for testing the pipeline setup"
logPath: "logs/"
interimPath: "interim/"
resultPath: "results/"

tools:
checkm2:
dbpath: /path/to/CheckM2_database/ # path to the db, downloaded with 'wget https://zenodo.org/api/files/fd3bc532-cd84-4907-b078-2e05a1e46803/checkm2_database.tar.gz'
gtdbtk:
dbpath: /path/to/GTDB-Tk_release220/ # path to the db, downloaded with 'wget https://data.gtdb.ecogenomic.org/releases/release220/220.0/auxillary_files/gtdbtk_package/full_package/gtdbtk_r220_data.tar.gz'
infernal:
dbpath: infernal/ # path where the Rfam 16S rRNA db will be created
3 changes: 3 additions & 0 deletions config/sampleData.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
sampleID,pathToAssemblyFasta,pathToSequencingReadsFastq,comment
Pfu,"data/GCF_000007305.1_ASM730v1_genomic.fna","data/SRR8767914_subsampled.fastq.gz","This is the reference genome of Pyrococcus furiosus, which does fit the criteria of SeqCode"
Tpa,"data/GCA_015662175.1_ASM1566217v1_genomic.fna","data/SRR8767914_subsampled.fastq.gz","This is the assembly of Thermococcus paralvinellae, which does not fit the criteria of SeqCode"
195 changes: 177 additions & 18 deletions workflow/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -34,10 +34,7 @@ RESULTPATH = os.path.normpath(config["main"]["resultPath"])

rule all:
input:
expand(
os.path.join(INTERIMPATH, "flags", "{sampleID}", "collectAll.flag"),
sampleID=SAMPLES,
),
os.path.join(RESULTPATH,"final_report.html"),

# Get the number of contigs (=num_seqs) , N50, and length of the largest contig (=max_len)
checkpoint seqkit_stats:
Expand All @@ -53,7 +50,7 @@ checkpoint seqkit_stats:
os.path.join(workflow.basedir, "envs","assemblyOnly.yaml")
shell:
"""
seqkit stats --tabular {input.assembly} > {output}
seqkit stats --all --tabular {input.assembly} > {output}
"""

# helper function to get a temporary directory with all of the assemblies in one dir, as some tools require this
Expand All @@ -65,7 +62,7 @@ rule collectAssemblies:
cols="pathToAssemblyFasta",
),
output:
assemblyCollected=os.path.join(RESULTPATH,"collected","{sampleID}.fasta"),
assemblyCollected=os.path.join(INTERIMPATH,"collected","{sampleID}.fasta"),
threads: 1
conda:
os.path.join(workflow.basedir, "envs","assemblyOnly.yaml")
Expand All @@ -79,9 +76,9 @@ rule checkm2:
input:
assembly=expand(rules.collectAssemblies.output, sampleID=SAMPLES)
output:
os.path.join(RESULTPATH,"checkm2","{sampleID}","quality_report.tsv")
os.path.join(RESULTPATH,"checkm2","quality_report.tsv")
log:
os.path.join(config["main"]["logPath"], "{sampleID}", "logs", "{sampleID}_checkm2.log"),
os.path.join(LOGPATH, "common", "logs", "checkm2.log"),
threads:
workflow.cores * 1
conda:
Expand All @@ -100,13 +97,13 @@ rule gtdbtk:
input:
assembliesCollected=expand(rules.collectAssemblies.output, sampleID=SAMPLES)
output:
directory(os.path.join(RESULTPATH,"gtdbtk","{sampleID}")),
directory(os.path.join(INTERIMPATH,"gtdbtk")),
log:
os.path.join(config["main"]["logPath"], "{sampleID}", "logs", "{sampleID}_gtdbtk.log"),
os.path.join(LOGPATH, "common", "logs", "gtdbtk.log"),
params:
inputDir=os.path.join(RESULTPATH,"collected"),
inputDir=lambda w, input:os.path.split(os.path.splitext(input[0])[0])[0],
dbpath=os.path.join(config["tools"]["gtdbtk"]["dbpath"]),
retries: 2
retries: 2 # this is necessary because the conda env var needs to be set
conda:
os.path.join(workflow.basedir, "envs","assemblyOnly.yaml")
shell:
Expand All @@ -115,6 +112,17 @@ rule gtdbtk:
gtdbtk classify_wf --cpus {threads} --skip_ani_screen --force --genome_dir {params.inputDir} --extension 'fasta' --out_dir {output} >{log} 2>&1
"""

# helper function to get all of the gtdbtk classification files, as they might be split between archaea and bacteria
rule collectGtdbtk:
input:
gtdbtk=rules.gtdbtk.output
output:
os.path.join(RESULTPATH,"gtdbtk","gtdbtk_taxonomy.tsv"),
shell:
"""
cat {input.gtdbtk}/*.summary.tsv > {output}
"""

# to compare the whole genome taxonomic classification with the 16S rRNA gene classification,
# we first need to extract the 16S rRNA gene sequences from the assemblies

Expand All @@ -127,7 +135,7 @@ rule prepare16SrRNAdb:
i1f=os.path.join(config["tools"]["infernal"]["dbpath"], "16SrRNA.cm.i1f"),
i1p=os.path.join(config["tools"]["infernal"]["dbpath"], "16SrRNA.cm.i1p"),
log:
os.path.join(config["main"]["logPath"], "prepare16SrRNAdb.log"),
os.path.join(LOGPATH, "prepare16SrRNAdb.log"),
conda:
os.path.join(workflow.basedir, "envs","assemblyOnly.yaml")
shadow: "copy-minimal"
Expand All @@ -146,6 +154,7 @@ def get_sum_length(sampleID):
sum_len = df["sum_len"].values[0]
return sum_len

# 16S rRNA gene prediction as implemented in bakta: https://github.com/oschwengers/bakta/blob/a7ac1c8641cf8a11888b0295c30f0b2e0b8f34fa/bakta/features/r_rna.py#L29
rule infernal:
input:
assembly=lookup(
Expand All @@ -156,11 +165,11 @@ rule infernal:
prepare16SrRNAdb=rules.prepare16SrRNAdb.output.cm,
seqkit_stats= lambda wildcards: checkpoints.seqkit_stats.get(sampleID=wildcards.sampleID).output,
output:
rRNAFasta=os.path.join(RESULTPATH,"infernal","{sampleID}_16S_rRNA.fasta"),
tblout=os.path.join(RESULTPATH,"infernal","{sampleID}.tblout"),
bed=os.path.join(RESULTPATH,"infernal","{sampleID}.bed"),
rRNAFasta=os.path.join(INTERIMPATH,"infernal","{sampleID}_16S_rRNA.fasta"),
tblout=os.path.join(INTERIMPATH,"infernal","{sampleID}.tblout"),
bed=os.path.join(INTERIMPATH,"infernal","{sampleID}.bed"),
log:
os.path.join(config["main"]["logPath"], "{sampleID}", "logs", "{sampleID}_infernal.log"),
os.path.join(LOGPATH, "{sampleID}", "logs", "{sampleID}_infernal.log"),
threads:
workflow.cores * 0.5
params:
Expand All @@ -178,5 +187,155 @@ rule infernal:
"""
cmscan --noali --cut_tc -g --fmt 2 --nohmmonly --rfam --cpu {threads} --tblout {output.tblout} {input.prepare16SrRNAdb} {input.assembly} >{log} 2>&1
awk '!/^#/ && $20 == "=" {{print}}' {output.tblout} | awk 'BEGIN{{OFS="\t"}} {{if ($12 == "-") {{print $4, $11, $10, $2, $17, $12}} else {{print $4, $10, $11, $2, $17, $12}}}}' | sort -k5,5nr | head -n 1 > {output.bed}
seqkit subseq --bed {output.bed} {input.assembly} > {output.rRNAFasta}
seqkit subseq --bed {output.bed} {input.assembly} | seqkit replace -p .+ -r "{wildcards.sampleID}" > {output.rRNAFasta}
"""

# helper function to concatenate all 16S rRNA gene sequences into one fasta file
rule concatFasta:
input:
rRNAFasta=expand(rules.infernal.output.rRNAFasta, sampleID=SAMPLES)
output:
os.path.join(INTERIMPATH,"infernal","16S_rRNA_combined.fasta"),
conda:
os.path.join(workflow.basedir, "envs","assemblyOnly.yaml")
shell:
"""
seqkit seq {input.rRNAFasta} > {output}
"""

# get the training data for decipher
rule downloadDecipherDB:
output:
db=os.path.join(INTERIMPATH,"DECIPHER", "SILVA_SSU_r138_2019.RData"),
log:
os.path.join(LOGPATH, "common", "logs", "downloadDecipherDB.log"),
shell:
"""
wget -O {output.db} http://www2.decipher.codes/Classification/TrainingSets/SILVA_SSU_r138_2019.RData >{log} 2>&1
"""

# 16S rRNA gene taxonomic classification to compare to the whole genome classification
rule decipher:
input:
rRNAFasta=rules.concatFasta.output,
db=rules.downloadDecipherDB.output.db
output:
os.path.join(RESULTPATH,"decipher","16S_rRNA_taxonomy.tsv"),
log:
os.path.join(LOGPATH, "common", "logs", "decipher.log"),
threads:
workflow.cores * 0.25
conda:
os.path.join(workflow.basedir, "envs","r-tools.yaml"),
params:
script=os.path.join(workflow.basedir, "scripts", "decipher.R"),
shell:
"""
Rscript {params.script} {input.db} {input.rRNAFasta} {threads} {output} >{log} 2>&1
"""

# tRNA prediction
rule trnascan:
input:
assembly=lookup(
query="sampleID == '{sampleID}'",
within=sampleDF,
cols="pathToAssemblyFasta",
),
output:
os.path.join(RESULTPATH,"trnascan","{sampleID}.txt"),
log:
os.path.join(LOGPATH, "{sampleID}", "logs", "{sampleID}_trnascan.log"),
threads:
workflow.cores * 0.25
conda:
os.path.join(workflow.basedir, "envs","assemblyOnly.yaml")
shell:
"""
tRNAscan-SE --forceow -G -o {output} --thread {threads} --log {log} {input.assembly}
"""

# for the coverage calculation, we need to map the reads to the assembly. But first we need to decide the mapping mode of minimap2.
# So we check if the reads are sort of short with seqkit stats.
checkpoint seqkit_stats_reads:
input:
reads=lookup(
query="sampleID == '{sampleID}'",
within=sampleDF,
cols="pathToSequencingReadsFastq",
),
output:
os.path.join(INTERIMPATH,"seqkit-stats","{sampleID}_reads.tsv")
conda:
os.path.join(workflow.basedir, "envs","withReads.yaml")
shell:
"""
seqkit stats --tabular {input.reads} > {output}
"""

# helper function to get the average read length
def get_avg_read_length(sampleID):
with checkpoints.seqkit_stats_reads.get(sampleID=sampleID).output[0].open() as file:
df = pd.read_csv(file, sep="\t")
avg_len = df["avg_len"].values[0]
return avg_len

# mapping the reads to the assembly and calculating the coverage
rule minimap:
input:
assembly=lookup(
query="sampleID == '{sampleID}'",
within=sampleDF,
cols="pathToAssemblyFasta",
),
reads=lookup(
query="sampleID == '{sampleID}'",
within=sampleDF,
cols="pathToSequencingReadsFastq",
),
readStats=lambda wildcards: checkpoints.seqkit_stats_reads.get(sampleID=wildcards.sampleID).output
output:
bam=os.path.join(INTERIMPATH, "minimap", "{sampleID}.bam"),
coverage=os.path.join(INTERIMPATH, "minimap", "{sampleID}_coverage.txt"),
threads: workflow.cores * 0.5
log:
os.path.join(LOGPATH, "{sampleID}", "logs", "{sampleID}_minimap.log"),
conda:
os.path.join(workflow.basedir, "envs","withReads.yaml")
params:
mappingMode= lambda w: "map-ont" if get_avg_read_length(w.sampleID) > 1000 else "sr"
shell:
"""
minimap2 -t {threads} -ax {params.mappingMode} {input.assembly} {input.reads} 2>{log} | \
samtools view -@ {threads}/2 -b 2>>{log} | \
samtools sort -@ {threads}/2 > {output.bam} 2>>{log} && \
samtools index {output.bam} 2>>{log} && \
samtools coverage -d 0 {output.bam} > {output.coverage} 2>>{log}
"""

rule createReport:
input:
checkm2=rules.checkm2.output,
seqkit_stats=expand(rules.seqkit_stats.output, sampleID=SAMPLES),
gtdbtk=rules.collectGtdbtk.output,
infernal=expand(rules.infernal.output, sampleID=SAMPLES),
minimap=expand(rules.minimap.output.bam, sampleID=SAMPLES),
samtools=expand(rules.minimap.output.coverage, sampleID=SAMPLES),
trnascan=expand(rules.trnascan.output, sampleID=SAMPLES),
decipher=rules.decipher.output,
output:
report=os.path.join(RESULTPATH,"final_report.html"),
log:
os.path.join(LOGPATH, "common", "logs", "createReport.log"),
params:
script=os.path.join(workflow.basedir, "scripts", "createReport.R"),
checkm2_dir=lambda w, input:os.path.split(os.path.splitext(input.checkm2[0])[0])[0],
seqkit_stats_dir=lambda w, input:os.path.split(os.path.splitext(input.seqkit_stats[0])[0])[0],
samtools_coverage_dir=lambda w, input:os.path.split(os.path.splitext(input.samtools[0])[0])[0],
tRNAscan_dir=lambda w, input:os.path.split(os.path.splitext(input.trnascan[0])[0])[0],
conda:
os.path.join(workflow.basedir, "envs","r-tools.yaml")
shell:
"""
Rscript {params.script} {input.checkm2} {params.seqkit_stats_dir} {params.samtools_coverage_dir} {input.gtdbtk} {input.decipher} {params.tRNAscan_dir} {output} >{log} 2>&1
"""
3 changes: 2 additions & 1 deletion workflow/envs/assemblyOnly.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,5 @@ dependencies:
- bioconda::checkm2==1.0.1
- bioconda::seqkit==2.8.2
- bioconda::gtdbtk==2.4.0
- bioconda::infernal==1.1.5
- bioconda::infernal==1.1.5
- bioconda::trnascan-se==2.0.12
11 changes: 11 additions & 0 deletions workflow/envs/r-tools.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
name: r-tools
channels:
- conda-forge
- bioconda
dependencies:
- conda-forge::r-tidyverse=2.0.0
- conda-forge::r-base=4.3.3
- conda-forge::r-fs==1.6.4
- conda-forge::r-tinytable==0.4.0
- conda-forge::r-markdown==1.13
- bioconda::bioconductor-decipher==2.30.0
8 changes: 8 additions & 0 deletions workflow/envs/withReads.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
name: main
channels:
- conda-forge
- bioconda
dependencies:
- bioconda::minimap2==2.28
- bioconda::seqkit==2.8.2
- bioconda::samtools==1.20
Loading

0 comments on commit e096927

Please sign in to comment.