-
Notifications
You must be signed in to change notification settings - Fork 9
07 Advanced Configurations
You can add an optional parameter: prokka-db
, which refers to the location of a .csv
file containing a list of your custom reference genomes for prokka
annotation:
projects:
- name: example
samples: config/samples.csv
prokka-db: config/prokka-db.csv
The file prokka-db.csv
should contain a list of high-quality annotated genomes that you would like to use to prioritize prokka annotations.
prokka-db.csv
example for Actinomycete group:
Accession | Strain Description |
---|---|
GCA_000203835.1 | Streptomyces coelicolor A3(2) |
GCA_000196835.1 | Amycolatopsis mediterranei U32 |
The workflow will prioritize user-provided taxonomic placement by adding an optional parameter: gtdb-tax
, which refers to a similar GTDB-tk summary file, but only the "user_genome" and "classification" columns are required.
gtdbtk.bac120.summary.tsv
example:
user_genome | classification |
---|---|
P8-2B-3.1 | d__Bacteria;p__Actinobacteriota;c__Actinomycetia;o__Streptomycetales;f__Streptomycetaceae;g__Streptomyces;s__Streptomyces albidoflavus |
If these are not provided, the workflow will use the closest_placement_reference
columns in the sample file (see above). Note that the value must be a valid genome accession in the latest GTDB release (currently R202), otherwise, it will raise an error.
If this information is not provided, then the workflow will guess the taxonomic placement by:
- If the
source
isncbi
, it will try to find the accession via GTDB API. If it doesn't find any information then, - It will use the
genus
table and find the parent taxonomy via GTDB API, which then results in_genus_ sp.
preceded by the matching parent taxonomy. - If both option does not find any taxonomic information, then it will return empty taxonomic values.
You can have multiple projects running by starting a new line of project information with "-
":
projects:
# Project 1
- name: example
samples: config/samples.csv
prokka-db: config/prokka-db.csv
# Project 2
- name: example_2
samples: config/samples_2.csv
Note that each project
must have a unique name
and samples
value.
By default, the resources folder containing software and database dependencies is stored in the resources/
directory.
If you already have the resources folder somewhere else in your local machine, you can tell the workflow about their locations:
resources_path:
antismash_db: $HOME/your_local_directory/antismash_db
eggnog_db: $HOME/your_local_directory/eggnog_db
BiG-SCAPE: $HOME/your_local_directory/BiG-SCAPE
Here you can find rules keywords that you can run within BGCflow.
Keywords | Description | Links |
---|---|---|
seqfu | Returns contig statistics of the genomes | SeqFu |
mlst | Returns genome classification within multi-locus sequence types (STs) | mlst |
refseq_masher | Identify the closest 10 NCBI Refseq genomes | RefSeq Masher |
mash | Calculate genomic distance using MAST | MASH |
fastani | Calculate nucleotide distance using fastANI | fastANI |
checkm | Assess genome quality | CheckM |
gtdbtk | Identify taxonomy of genomes using GTDB-toolkit | GTDBTk |
prokka-gbk | Returns annotated .gbk files |
Prokka |
diamond | Create diamond database for alignment | DIAMOND |
antismash-summary | Summary of BGCs statistics | antiSMASH |
antismash-zip | Returns zipped antiSMASH result | antiSMASH |
query_bigslice | Query BGCs with BiG-FAM db* | BiG-SLICE |
bigscape | Build Gene Cluster Families with BiG-SCAPE | BiG-SCAPE |
bigslice | Build Gene Cluster Families with BiG-SLICE | BiG-SLICE |
automlst_wrapper | Build a phylogenomic tree with autoMLST wrapper | autoMLST-wrapper, autoMLST |
roary | Build Pangenome | Roary |
eggnog | Functional annotation with EggNOG-mapper | EggNOG-mapper |
deeptfactor | Prediction of transcription factors with DeepTFactor | DeepTFactor |
roary++ | Apply multiple tools together with Roary pangenome (diamond, automlst_wrapper, eggnog, deeptfactor) | Roary |
cblaster-genome | Generate cblaster databases for genomes in project | cblaster |
cblaster-bgcs | Generate cblaster databases for bgcs in project | cblaster |
When using different machines, you can, for example, adapt the number of threads required for each rule using a Snakemake profile. An example is given in config/examples/_profile_example/config.yaml
:
set-threads:
- antismash=4
- arts=4
- bigscape=32
- bigslice=16
You can use run a snakemake job with the above profile with:
snakemake --profile config/examples/_profile_example/ --use-conda -c $N -n # remove the dry-run parameters "-n" for the actual run
Or also with a defined config
file:
snakemake --configfile config/examples/_config_example.yaml --profile config/examples/_profile_example/ --use-conda -c $N -n # remove the dry-run parameters "-n" for the actual run
From BGCFlow version 0.7.0
, user can switch between antiSMASH 6.1.1
and 7.0.0
by changing the version
parameter in the global config.yaml
file:
rule_parameters:
antismash:
version: 7 # valid versions: 6, 7
From BGCFlow version 0.7.0
, user can switch between GTDB releases by changing the parameter in the global config.yaml
file:
rule_parameters:
install_gtdbtk:
release: "214.1"
release_version: "r214"
Check for valid release versions from https://data.gtdb.ecogenomic.org/releases/
The release
and release_version
refers to this examples:
- release
214.1
, release_versionr214
--> https://data.gtdb.ecogenomic.org/releases/release214/214.1/auxillary_files/gtdbtk_r214_data.tar.gz - release
207
, release_versionr207_v2
--> https://data.gtdb.ecogenomic.org/releases/release207/207.0/auxillary_files/gtdbtk_r207_v2_data.tar.gz
From BGCFlow version 0.7.0
, user can turn on GTDB-tk ani_screen
by setting parameter in the global config.yaml
file:
rule_parameters:
gtdbtk:
ani_screen: TRUE
Set use_gtdb_api to False to enter offline mode
rule_parameters:
install_gtdbtk:
release: "214.1"
release_version: "r214"
gtdbtk:
ani_screen: FALSE
antismash:
version: "7" # valid versions: 6, 7
use_gtdb_api: FALSE # use GTDB API to get taxonomy information
This can be combined with giving the project a custom taxonomic assignment from: https://data.gtdb.ecogenomic.org/releases/release214/214.1/bac120_taxonomy_r214.tsv
There is a script to reformat this table for BGCFlow. Example usage:
python workflow/scripts/grab_gtdb_tax_table.py --url "https://data.gtdb.ecogenomic.org/releases/release214/214.1/bac120_taxonomy_r214.tsv" --outfile config/Lactobacillus_delbrueckii/bac120_taxonomy_r214.tsv