initial set-up

richardstoeckl · Aug 29, 2024 · e096927 · e096927
1 parent 43c8176
commit e096927
Show file tree

Hide file tree

Showing 10 changed files with 503 additions and 22 deletions.
diff --git a/.snakemake-workflow-catalog.yml b/.snakemake-workflow-catalog.yml
@@ -0,0 +1,3 @@
+usage:
+  software-stack-deployment: # definition of software deployment method (at least one of conda, singularity, or singularity+conda)
+    conda: true # whether pipeline works with --use-conda
diff --git a/README.md b/README.md
@@ -1,5 +1,55 @@
-# QCforSeqCode
-Snakemake Pipeline to check the requirements for a prokaryotic assembly to be included in the SeqCode initiative.
+# Snakemake workflow: `QCforSeqCode`
 
-The requirements are outlined in https://registry.seqco.de/page/seqcode#data-quality-necessary-for-completion-of-seqcode-registryb
+Author: richard.stoeckl@ur.de
 
+[![Snakemake](https://img.shields.io/badge/snakemake-≥8.10.0-brightgreen.svg)](https://snakemake.github.io)
+
+## About
+[Snakemake](https://snakemake.github.io) Pipeline to check the requirements for a prokaryotic assembly to be included in the [SeqCode](https://registry.seqco.de/) initiative.
+
+The requirements are outlined in [APPENDIX I](https://registry.seqco.de/page/seqcode#data-quality-necessary-for-completion-of-seqcode-registryb) of the SeqCode.
+
+## Usage
+
+**[Check out the usage instructions in the snakemake workflow catalog](https://snakemake.github.io/snakemake-workflow-catalog?usage=richardstoeckl/QCforSeqCode)**
+
+But here is a rough overview:
+1. Install [conda](https://docs.conda.io/en/latest/miniconda.html) (mamba or miniconda is fine).
+2. Install snakemake with:
+```bash
+conda install -c conda-forge -c bioconda snakemake
+```
+3. Download checkm2 database (via 'wget https://zenodo.org/api/files/fd3bc532-cd84-4907-b078-2e05a1e46803/checkm2_database.tar.gz')
+4. Download GTDB-Tk database (via 'wget https://data.gtdb.ecogenomic.org/releases/release220/220.0/auxillary_files/gtdbtk_package/full_package/gtdbtk_r220_data.tar.gz')
+3. [Download the latest release from this repo](https://github.com/richardstoeckl/basecallNanopore/releases/latest) and cd into it
+4. Edit the `config/config.yaml` to provide the paths to your results/logs directories, and the paths to the databases you downloaded, as well as any parameters you might want to change.
+5. Edit the `config/sampleData.csv` file with the specific details for each assembly you want to check. Depending on what you enter here, the pipeline will automatically adjust what will be done.
+5. Open a terminal in the main dir and start a dry-run of the pipeline with the following command. This will download and install all the dependencies for the pipeline (this step takes may take some time) and it will show you if you set up the paths correctly:
+
+```bash
+snakemake --sdm conda -n --cores
+```
+6. Run the pipeline with
+```bash
+snakemake --sdm conda --cores
+```
+---
+
+## TODO and planned features
+- add 16S rRNA gene truncation check
+- add automatic switches for Kingdom specific modes of some tools
+- automate checkm2 and gtdb-tk database downloads
+- add checks if the config file and the sample file are correctly filled
+
+
+## Notes on the Test data:
+- `data/GCF_000007305.1_ASM730v1_genomic.fna` - This is the reference genome of Pyrococcus furiosus, which does fit the criteria of SeqCode. It was acquired from the [RefSeq database](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000007305.1/).
+- `data/GCA_015662175.1_ASM1566217v1_genomic.fna` - This is the assembly of Thermococcus paralvinellae, which does not fit the criteria of SeqCode. It was acquired from [GenBank database](https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_015662175.1/)
+- `data/SRR8767914_subsampled.fastq.gz` is a [DNA-Seq of Pyrococcus furiosus DSM 3638](https://www.ncbi.nlm.nih.gov/sra/SRR8767914) dataset, that was subsampled for quicker testing via `zcat SRR8767914.fastq.gz | seqkit sample --rand-seed 42 -p 0.1 -o SRR8767914_subsampled.fastq.gz`.
+
+```
+Copyright Richard Stöckl 2024.
+Distributed under the Boost Software License, Version 1.0.
+(See accompanying file LICENSE or copy at 
+https://www.boost.org/LICENSE_1_0.txt)
+```
diff --git a/config/config.yaml b/config/config.yaml
@@ -0,0 +1,13 @@
+main:
+    sampleData: "sampleData.csv" # "config/sampleData_tests.csv is the sample file that can be used for testing the pipeline setup"
+    logPath: "logs/"
+    interimPath: "interim/"
+    resultPath: "results/"
+
+tools:
+    checkm2:
+        dbpath: /path/to/CheckM2_database/ # path to the db, downloaded with 'wget https://zenodo.org/api/files/fd3bc532-cd84-4907-b078-2e05a1e46803/checkm2_database.tar.gz'
+    gtdbtk:
+        dbpath: /path/to/GTDB-Tk_release220/  # path to the db, downloaded with 'wget https://data.gtdb.ecogenomic.org/releases/release220/220.0/auxillary_files/gtdbtk_package/full_package/gtdbtk_r220_data.tar.gz'
+    infernal:
+        dbpath: infernal/ # path where the Rfam 16S rRNA db will be created
diff --git a/config/sampleData.csv b/config/sampleData.csv
@@ -0,0 +1,3 @@
+sampleID,pathToAssemblyFasta,pathToSequencingReadsFastq,comment
+Pfu,"data/GCF_000007305.1_ASM730v1_genomic.fna","data/SRR8767914_subsampled.fastq.gz","This is the reference genome of Pyrococcus furiosus, which does fit the criteria of SeqCode"
+Tpa,"data/GCA_015662175.1_ASM1566217v1_genomic.fna","data/SRR8767914_subsampled.fastq.gz","This is the assembly of Thermococcus paralvinellae, which does not fit the criteria of SeqCode"
diff --git a/workflow/Snakefile b/workflow/Snakefile
@@ -34,10 +34,7 @@ RESULTPATH = os.path.normpath(config["main"]["resultPath"])
 
 rule all:
     input:
-        expand(
-            os.path.join(INTERIMPATH, "flags", "{sampleID}", "collectAll.flag"),
-            sampleID=SAMPLES,
-        ),
+        os.path.join(RESULTPATH,"final_report.html"),
 
 # Get the number of contigs (=num_seqs) , N50, and length of the largest contig (=max_len)
 checkpoint seqkit_stats:
@@ -53,7 +50,7 @@ checkpoint seqkit_stats:
         os.path.join(workflow.basedir, "envs","assemblyOnly.yaml")
     shell:
         """
-        seqkit stats --tabular {input.assembly} > {output}
+        seqkit stats --all --tabular {input.assembly} > {output}
         """
 
 # helper function to get a temporary directory with all of the assemblies in one dir, as some tools require this
@@ -65,7 +62,7 @@ rule collectAssemblies:
             cols="pathToAssemblyFasta",
         ),
     output:
-        assemblyCollected=os.path.join(RESULTPATH,"collected","{sampleID}.fasta"),
+        assemblyCollected=os.path.join(INTERIMPATH,"collected","{sampleID}.fasta"),
     threads: 1
     conda:
         os.path.join(workflow.basedir, "envs","assemblyOnly.yaml")
@@ -79,9 +76,9 @@ rule checkm2:
     input:
         assembly=expand(rules.collectAssemblies.output, sampleID=SAMPLES)
     output:
-        os.path.join(RESULTPATH,"checkm2","{sampleID}","quality_report.tsv")
+        os.path.join(RESULTPATH,"checkm2","quality_report.tsv")
     log:
-        os.path.join(config["main"]["logPath"], "{sampleID}", "logs", "{sampleID}_checkm2.log"),
+        os.path.join(LOGPATH, "common", "logs", "checkm2.log"),
     threads:
         workflow.cores * 1
     conda:
@@ -100,13 +97,13 @@ rule gtdbtk:
     input:
         assembliesCollected=expand(rules.collectAssemblies.output, sampleID=SAMPLES)
     output:
-        directory(os.path.join(RESULTPATH,"gtdbtk","{sampleID}")),
+        directory(os.path.join(INTERIMPATH,"gtdbtk")),
     log:
-        os.path.join(config["main"]["logPath"], "{sampleID}", "logs", "{sampleID}_gtdbtk.log"),
+        os.path.join(LOGPATH, "common", "logs", "gtdbtk.log"),
     params:
-        inputDir=os.path.join(RESULTPATH,"collected"),
+        inputDir=lambda w, input:os.path.split(os.path.splitext(input[0])[0])[0],
         dbpath=os.path.join(config["tools"]["gtdbtk"]["dbpath"]),
-    retries: 2
+    retries: 2 # this is necessary because the conda env var needs to be set
     conda:
         os.path.join(workflow.basedir, "envs","assemblyOnly.yaml")
     shell:
@@ -115,6 +112,17 @@ rule gtdbtk:
         gtdbtk classify_wf --cpus {threads} --skip_ani_screen --force --genome_dir {params.inputDir} --extension 'fasta' --out_dir {output} >{log} 2>&1
         """
 
+# helper function to get all of the gtdbtk classification files, as they might be split between archaea and bacteria
+rule collectGtdbtk:
+    input:
+        gtdbtk=rules.gtdbtk.output
+    output:
+        os.path.join(RESULTPATH,"gtdbtk","gtdbtk_taxonomy.tsv"),
+    shell:
+        """
+        cat {input.gtdbtk}/*.summary.tsv > {output}
+        """
+
 # to compare the whole genome taxonomic classification with the 16S rRNA gene classification,
 # we first need to extract the 16S rRNA gene sequences from the assemblies
 
@@ -127,7 +135,7 @@ rule prepare16SrRNAdb:
         i1f=os.path.join(config["tools"]["infernal"]["dbpath"], "16SrRNA.cm.i1f"),
         i1p=os.path.join(config["tools"]["infernal"]["dbpath"], "16SrRNA.cm.i1p"),
     log:
-        os.path.join(config["main"]["logPath"], "prepare16SrRNAdb.log"),
+        os.path.join(LOGPATH, "prepare16SrRNAdb.log"),
     conda:
         os.path.join(workflow.basedir, "envs","assemblyOnly.yaml")
     shadow: "copy-minimal"
@@ -146,6 +154,7 @@ def get_sum_length(sampleID):
         sum_len = df["sum_len"].values[0]
     return sum_len
 
+# 16S rRNA gene prediction as implemented in bakta: https://github.com/oschwengers/bakta/blob/a7ac1c8641cf8a11888b0295c30f0b2e0b8f34fa/bakta/features/r_rna.py#L29
 rule infernal:
     input:
         assembly=lookup(
@@ -156,11 +165,11 @@ rule infernal:
         prepare16SrRNAdb=rules.prepare16SrRNAdb.output.cm,
         seqkit_stats= lambda wildcards: checkpoints.seqkit_stats.get(sampleID=wildcards.sampleID).output,
     output:
-        rRNAFasta=os.path.join(RESULTPATH,"infernal","{sampleID}_16S_rRNA.fasta"),
-        tblout=os.path.join(RESULTPATH,"infernal","{sampleID}.tblout"),
-        bed=os.path.join(RESULTPATH,"infernal","{sampleID}.bed"),
+        rRNAFasta=os.path.join(INTERIMPATH,"infernal","{sampleID}_16S_rRNA.fasta"),
+        tblout=os.path.join(INTERIMPATH,"infernal","{sampleID}.tblout"),
+        bed=os.path.join(INTERIMPATH,"infernal","{sampleID}.bed"),
     log:
-        os.path.join(config["main"]["logPath"], "{sampleID}", "logs", "{sampleID}_infernal.log"),
+        os.path.join(LOGPATH, "{sampleID}", "logs", "{sampleID}_infernal.log"),
     threads:
         workflow.cores * 0.5
     params:
@@ -178,5 +187,155 @@ rule infernal:
         """
         cmscan --noali --cut_tc -g --fmt 2 --nohmmonly --rfam --cpu {threads} --tblout {output.tblout} {input.prepare16SrRNAdb} {input.assembly} >{log} 2>&1
         awk '!/^#/ && $20 == "=" {{print}}' {output.tblout}  | awk 'BEGIN{{OFS="\t"}} {{if ($12 == "-") {{print $4, $11, $10, $2, $17, $12}} else {{print $4, $10, $11, $2, $17, $12}}}}' | sort -k5,5nr | head -n 1 > {output.bed}
-        seqkit subseq --bed {output.bed} {input.assembly} > {output.rRNAFasta}
+        seqkit subseq --bed {output.bed} {input.assembly} | seqkit replace -p .+ -r "{wildcards.sampleID}" > {output.rRNAFasta}
+        """
+
+# helper function to concatenate all 16S rRNA gene sequences into one fasta file
+rule concatFasta:
+    input:
+        rRNAFasta=expand(rules.infernal.output.rRNAFasta, sampleID=SAMPLES)
+    output:
+        os.path.join(INTERIMPATH,"infernal","16S_rRNA_combined.fasta"),
+    conda:
+        os.path.join(workflow.basedir, "envs","assemblyOnly.yaml")
+    shell:
+        """
+        seqkit seq {input.rRNAFasta} > {output}
+        """
+
+# get the training data for decipher
+rule downloadDecipherDB:
+    output:
+        db=os.path.join(INTERIMPATH,"DECIPHER", "SILVA_SSU_r138_2019.RData"),
+    log:
+        os.path.join(LOGPATH, "common", "logs", "downloadDecipherDB.log"),
+    shell:
+        """
+        wget -O {output.db} http://www2.decipher.codes/Classification/TrainingSets/SILVA_SSU_r138_2019.RData >{log} 2>&1
+        """
+
+# 16S rRNA gene taxonomic classification to compare to the whole genome classification
+rule decipher:
+    input:
+        rRNAFasta=rules.concatFasta.output,
+        db=rules.downloadDecipherDB.output.db
+    output:
+        os.path.join(RESULTPATH,"decipher","16S_rRNA_taxonomy.tsv"),
+    log:
+        os.path.join(LOGPATH, "common", "logs", "decipher.log"),
+    threads:
+        workflow.cores * 0.25
+    conda:
+        os.path.join(workflow.basedir, "envs","r-tools.yaml"),
+    params:
+        script=os.path.join(workflow.basedir, "scripts", "decipher.R"),
+    shell:
+        """
+        Rscript {params.script} {input.db} {input.rRNAFasta} {threads} {output} >{log} 2>&1
+        """
+
+# tRNA prediction
+rule trnascan:
+    input:
+        assembly=lookup(
+            query="sampleID == '{sampleID}'",
+            within=sampleDF,
+            cols="pathToAssemblyFasta",
+        ),
+    output:
+        os.path.join(RESULTPATH,"trnascan","{sampleID}.txt"),
+    log:
+        os.path.join(LOGPATH, "{sampleID}", "logs", "{sampleID}_trnascan.log"),
+    threads:
+        workflow.cores * 0.25
+    conda:
+        os.path.join(workflow.basedir, "envs","assemblyOnly.yaml")
+    shell:
+        """
+        tRNAscan-SE --forceow -G -o {output} --thread {threads} --log {log} {input.assembly}
+        """
+
+# for the coverage calculation, we need to map the reads to the assembly. But first we need to decide the mapping mode of minimap2.
+# So we check if the reads are sort of short with seqkit stats.
+checkpoint seqkit_stats_reads:
+    input:
+        reads=lookup(
+            query="sampleID == '{sampleID}'",
+            within=sampleDF,
+            cols="pathToSequencingReadsFastq",
+        ),
+    output:
+        os.path.join(INTERIMPATH,"seqkit-stats","{sampleID}_reads.tsv")
+    conda:
+        os.path.join(workflow.basedir, "envs","withReads.yaml")
+    shell:
+        """
+        seqkit stats --tabular {input.reads} > {output}
+        """
+
+# helper function to get the average read length
+def get_avg_read_length(sampleID):
+    with checkpoints.seqkit_stats_reads.get(sampleID=sampleID).output[0].open() as file:
+        df = pd.read_csv(file, sep="\t")
+        avg_len = df["avg_len"].values[0]
+    return avg_len
+
+# mapping the reads to the assembly and calculating the coverage
+rule minimap:
+    input:
+        assembly=lookup(
+            query="sampleID == '{sampleID}'",
+            within=sampleDF,
+            cols="pathToAssemblyFasta",
+        ),
+        reads=lookup(
+            query="sampleID == '{sampleID}'",
+            within=sampleDF,
+            cols="pathToSequencingReadsFastq",
+        ),
+        readStats=lambda wildcards: checkpoints.seqkit_stats_reads.get(sampleID=wildcards.sampleID).output
+    output:
+        bam=os.path.join(INTERIMPATH, "minimap", "{sampleID}.bam"),
+        coverage=os.path.join(INTERIMPATH, "minimap", "{sampleID}_coverage.txt"),
+    threads: workflow.cores * 0.5
+    log:
+        os.path.join(LOGPATH, "{sampleID}", "logs", "{sampleID}_minimap.log"),
+    conda:
+        os.path.join(workflow.basedir, "envs","withReads.yaml")
+    params:
+        mappingMode= lambda w: "map-ont" if get_avg_read_length(w.sampleID) > 1000 else "sr"
+    shell:
+        """
+        minimap2 -t {threads} -ax {params.mappingMode} {input.assembly} {input.reads} 2>{log} | \
+        samtools view -@ {threads}/2 -b 2>>{log} | \
+        samtools sort -@ {threads}/2 > {output.bam} 2>>{log} && \
+        samtools index {output.bam} 2>>{log} && \
+        samtools coverage -d 0 {output.bam} > {output.coverage} 2>>{log}
+        """
+
+rule createReport:
+    input:
+        checkm2=rules.checkm2.output,
+        seqkit_stats=expand(rules.seqkit_stats.output, sampleID=SAMPLES),
+        gtdbtk=rules.collectGtdbtk.output,
+        infernal=expand(rules.infernal.output, sampleID=SAMPLES),
+        minimap=expand(rules.minimap.output.bam, sampleID=SAMPLES),
+        samtools=expand(rules.minimap.output.coverage, sampleID=SAMPLES),
+        trnascan=expand(rules.trnascan.output, sampleID=SAMPLES),
+        decipher=rules.decipher.output,
+    output:
+        report=os.path.join(RESULTPATH,"final_report.html"),
+    log:
+        os.path.join(LOGPATH, "common", "logs", "createReport.log"),
+    params:
+        script=os.path.join(workflow.basedir, "scripts", "createReport.R"),
+        checkm2_dir=lambda w, input:os.path.split(os.path.splitext(input.checkm2[0])[0])[0],
+        seqkit_stats_dir=lambda w, input:os.path.split(os.path.splitext(input.seqkit_stats[0])[0])[0],
+        samtools_coverage_dir=lambda w, input:os.path.split(os.path.splitext(input.samtools[0])[0])[0],
+        tRNAscan_dir=lambda w, input:os.path.split(os.path.splitext(input.trnascan[0])[0])[0],
+    conda:
+        os.path.join(workflow.basedir, "envs","r-tools.yaml")
+    shell:
         """
+        Rscript {params.script} {input.checkm2} {params.seqkit_stats_dir} {params.samtools_coverage_dir} {input.gtdbtk} {input.decipher} {params.tRNAscan_dir} {output} >{log} 2>&1
+        """
diff --git a/workflow/envs/assemblyOnly.yaml b/workflow/envs/assemblyOnly.yaml
@@ -6,4 +6,5 @@ dependencies:
   - bioconda::checkm2==1.0.1
   - bioconda::seqkit==2.8.2
   - bioconda::gtdbtk==2.4.0
-  - bioconda::infernal==1.1.5 
+  - bioconda::infernal==1.1.5
+  - bioconda::trnascan-se==2.0.12
diff --git a/workflow/envs/r-tools.yaml b/workflow/envs/r-tools.yaml
@@ -0,0 +1,11 @@
+name: r-tools
+channels:
+  - conda-forge
+  - bioconda
+dependencies:
+  - conda-forge::r-tidyverse=2.0.0
+  - conda-forge::r-base=4.3.3
+  - conda-forge::r-fs==1.6.4
+  - conda-forge::r-tinytable==0.4.0
+  - conda-forge::r-markdown==1.13
+  - bioconda::bioconductor-decipher==2.30.0
diff --git a/workflow/envs/withReads.yaml b/workflow/envs/withReads.yaml
@@ -0,0 +1,8 @@
+name: main
+channels:
+  - conda-forge
+  - bioconda
+dependencies:
+  - bioconda::minimap2==2.28
+  - bioconda::seqkit==2.8.2
+  - bioconda::samtools==1.20