-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Richard Stöckl
committed
Aug 29, 2024
1 parent
43c8176
commit e096927
Showing
10 changed files
with
503 additions
and
22 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
usage: | ||
software-stack-deployment: # definition of software deployment method (at least one of conda, singularity, or singularity+conda) | ||
conda: true # whether pipeline works with --use-conda |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,55 @@ | ||
# QCforSeqCode | ||
Snakemake Pipeline to check the requirements for a prokaryotic assembly to be included in the SeqCode initiative. | ||
# Snakemake workflow: `QCforSeqCode` | ||
|
||
The requirements are outlined in https://registry.seqco.de/page/seqcode#data-quality-necessary-for-completion-of-seqcode-registryb | ||
Author: richard.stoeckl@ur.de | ||
|
||
[data:image/s3,"s3://crabby-images/5ecd9/5ecd912edef3e9424164d8e662bff76016eaa547" alt="Snakemake"](https://snakemake.github.io) | ||
|
||
## About | ||
[Snakemake](https://snakemake.github.io) Pipeline to check the requirements for a prokaryotic assembly to be included in the [SeqCode](https://registry.seqco.de/) initiative. | ||
|
||
The requirements are outlined in [APPENDIX I](https://registry.seqco.de/page/seqcode#data-quality-necessary-for-completion-of-seqcode-registryb) of the SeqCode. | ||
|
||
## Usage | ||
|
||
**[Check out the usage instructions in the snakemake workflow catalog](https://snakemake.github.io/snakemake-workflow-catalog?usage=richardstoeckl/QCforSeqCode)** | ||
|
||
But here is a rough overview: | ||
1. Install [conda](https://docs.conda.io/en/latest/miniconda.html) (mamba or miniconda is fine). | ||
2. Install snakemake with: | ||
```bash | ||
conda install -c conda-forge -c bioconda snakemake | ||
``` | ||
3. Download checkm2 database (via 'wget https://zenodo.org/api/files/fd3bc532-cd84-4907-b078-2e05a1e46803/checkm2_database.tar.gz') | ||
4. Download GTDB-Tk database (via 'wget https://data.gtdb.ecogenomic.org/releases/release220/220.0/auxillary_files/gtdbtk_package/full_package/gtdbtk_r220_data.tar.gz') | ||
3. [Download the latest release from this repo](https://github.com/richardstoeckl/basecallNanopore/releases/latest) and cd into it | ||
4. Edit the `config/config.yaml` to provide the paths to your results/logs directories, and the paths to the databases you downloaded, as well as any parameters you might want to change. | ||
5. Edit the `config/sampleData.csv` file with the specific details for each assembly you want to check. Depending on what you enter here, the pipeline will automatically adjust what will be done. | ||
5. Open a terminal in the main dir and start a dry-run of the pipeline with the following command. This will download and install all the dependencies for the pipeline (this step takes may take some time) and it will show you if you set up the paths correctly: | ||
|
||
```bash | ||
snakemake --sdm conda -n --cores | ||
``` | ||
6. Run the pipeline with | ||
```bash | ||
snakemake --sdm conda --cores | ||
``` | ||
--- | ||
|
||
## TODO and planned features | ||
- add 16S rRNA gene truncation check | ||
- add automatic switches for Kingdom specific modes of some tools | ||
- automate checkm2 and gtdb-tk database downloads | ||
- add checks if the config file and the sample file are correctly filled | ||
|
||
|
||
## Notes on the Test data: | ||
- `data/GCF_000007305.1_ASM730v1_genomic.fna` - This is the reference genome of Pyrococcus furiosus, which does fit the criteria of SeqCode. It was acquired from the [RefSeq database](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000007305.1/). | ||
- `data/GCA_015662175.1_ASM1566217v1_genomic.fna` - This is the assembly of Thermococcus paralvinellae, which does not fit the criteria of SeqCode. It was acquired from [GenBank database](https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_015662175.1/) | ||
- `data/SRR8767914_subsampled.fastq.gz` is a [DNA-Seq of Pyrococcus furiosus DSM 3638](https://www.ncbi.nlm.nih.gov/sra/SRR8767914) dataset, that was subsampled for quicker testing via `zcat SRR8767914.fastq.gz | seqkit sample --rand-seed 42 -p 0.1 -o SRR8767914_subsampled.fastq.gz`. | ||
|
||
``` | ||
Copyright Richard Stöckl 2024. | ||
Distributed under the Boost Software License, Version 1.0. | ||
(See accompanying file LICENSE or copy at | ||
https://www.boost.org/LICENSE_1_0.txt) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
main: | ||
sampleData: "sampleData.csv" # "config/sampleData_tests.csv is the sample file that can be used for testing the pipeline setup" | ||
logPath: "logs/" | ||
interimPath: "interim/" | ||
resultPath: "results/" | ||
|
||
tools: | ||
checkm2: | ||
dbpath: /path/to/CheckM2_database/ # path to the db, downloaded with 'wget https://zenodo.org/api/files/fd3bc532-cd84-4907-b078-2e05a1e46803/checkm2_database.tar.gz' | ||
gtdbtk: | ||
dbpath: /path/to/GTDB-Tk_release220/ # path to the db, downloaded with 'wget https://data.gtdb.ecogenomic.org/releases/release220/220.0/auxillary_files/gtdbtk_package/full_package/gtdbtk_r220_data.tar.gz' | ||
infernal: | ||
dbpath: infernal/ # path where the Rfam 16S rRNA db will be created |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
sampleID,pathToAssemblyFasta,pathToSequencingReadsFastq,comment | ||
Pfu,"data/GCF_000007305.1_ASM730v1_genomic.fna","data/SRR8767914_subsampled.fastq.gz","This is the reference genome of Pyrococcus furiosus, which does fit the criteria of SeqCode" | ||
Tpa,"data/GCA_015662175.1_ASM1566217v1_genomic.fna","data/SRR8767914_subsampled.fastq.gz","This is the assembly of Thermococcus paralvinellae, which does not fit the criteria of SeqCode" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
name: r-tools | ||
channels: | ||
- conda-forge | ||
- bioconda | ||
dependencies: | ||
- conda-forge::r-tidyverse=2.0.0 | ||
- conda-forge::r-base=4.3.3 | ||
- conda-forge::r-fs==1.6.4 | ||
- conda-forge::r-tinytable==0.4.0 | ||
- conda-forge::r-markdown==1.13 | ||
- bioconda::bioconductor-decipher==2.30.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
name: main | ||
channels: | ||
- conda-forge | ||
- bioconda | ||
dependencies: | ||
- bioconda::minimap2==2.28 | ||
- bioconda::seqkit==2.8.2 | ||
- bioconda::samtools==1.20 |
Oops, something went wrong.