07 Advanced Configurations

Further configuration

Custom Prokka database

You can add an optional parameter: prokka-db, which refers to the location of a .csv file containing a list of your custom reference genomes for prokka annotation:

projects:
  - name: example
    samples: config/samples.csv
    prokka-db: config/prokka-db.csv

The file prokka-db.csv should contain a list of high-quality annotated genomes that you would like to use to prioritize prokka annotations.

prokka-db.csv example for Actinomycete group:

Accession	Strain Description
GCA_000203835.1	Streptomyces coelicolor A3(2)
GCA_000196835.1	Amycolatopsis mediterranei U32

Taxonomic Placement

The workflow will prioritize user-provided taxonomic placement by adding an optional parameter: gtdb-tax, which refers to a similar GTDB-tk summary file, but only the "user_genome" and "classification" columns are required.

gtdbtk.bac120.summary.tsv example:

user_genome	classification
P8-2B-3.1	d__Bacteria;p__Actinobacteriota;c__Actinomycetia;o__Streptomycetales;f__Streptomycetaceae;g__Streptomyces;s__Streptomyces albidoflavus

If these are not provided, the workflow will use the closest_placement_reference columns in the sample file (see above). Note that the value must be a valid genome accession in the latest GTDB release (currently R202), otherwise, it will raise an error.

If this information is not provided, then the workflow will guess the taxonomic placement by:

If the source is ncbi, it will try to find the accession via GTDB API. If it doesn't find any information then,
It will use the genus table and find the parent taxonomy via GTDB API, which then results in _genus_ sp. preceded by the matching parent taxonomy.
If both option does not find any taxonomic information, then it will return empty taxonomic values.

Running multiple projects

You can have multiple projects running by starting a new line of project information with "-":

projects:
# Project 1
  - name: example
    samples: config/samples.csv
    prokka-db: config/prokka-db.csv
# Project 2
  - name: example_2
    samples: config/samples_2.csv

Note that each project must have a unique name and samples value.

Setting custom resources/databases folder

By default, the resources folder containing software and database dependencies is stored in the resources/ directory.

If you already have the resources folder somewhere else in your local machine, you can tell the workflow about their locations:

resources_path:
  antismash_db: $HOME/your_local_directory/antismash_db
  eggnog_db: $HOME/your_local_directory/eggnog_db
  BiG-SCAPE: $HOME/your_local_directory/BiG-SCAPE

List of Configurable Features

Here you can find rules keywords that you can run within BGCflow.

Keywords	Description	Links
seqfu	Returns contig statistics of the genomes	SeqFu
mlst	Returns genome classification within multi-locus sequence types (STs)	mlst
refseq_masher	Identify the closest 10 NCBI Refseq genomes	RefSeq Masher
mash	Calculate genomic distance using MAST	MASH
fastani	Calculate nucleotide distance using fastANI	fastANI
checkm	Assess genome quality	CheckM
gtdbtk	Identify taxonomy of genomes using GTDB-toolkit	GTDBTk
prokka-gbk	Returns annotated `.gbk` files	Prokka
diamond	Create diamond database for alignment	DIAMOND
antismash-summary	Summary of BGCs statistics	antiSMASH
antismash-zip	Returns zipped antiSMASH result	antiSMASH
query_bigslice	Query BGCs with BiG-FAM db*	BiG-SLICE
bigscape	Build Gene Cluster Families with BiG-SCAPE	BiG-SCAPE
bigslice	Build Gene Cluster Families with BiG-SLICE	BiG-SLICE
automlst_wrapper	Build a phylogenomic tree with autoMLST wrapper	autoMLST-wrapper, autoMLST
roary	Build Pangenome	Roary
eggnog	Functional annotation with EggNOG-mapper	EggNOG-mapper
deeptfactor	Prediction of transcription factors with DeepTFactor	DeepTFactor
roary++	Apply multiple tools together with Roary pangenome (diamond, automlst_wrapper, eggnog, deeptfactor)	Roary
cblaster-genome	Generate cblaster databases for genomes in project	cblaster
cblaster-bgcs	Generate cblaster databases for bgcs in project	cblaster

Using snakemake profiles for further configurations

When using different machines, you can, for example, adapt the number of threads required for each rule using a Snakemake profile. An example is given in config/examples/_profile_example/config.yaml:

set-threads:
  - antismash=4
  - arts=4
  - bigscape=32
  - bigslice=16

You can use run a snakemake job with the above profile with:

snakemake --profile config/examples/_profile_example/ --use-conda -c $N -n # remove the dry-run parameters "-n" for the actual run

Or also with a defined config file:

snakemake --configfile config/examples/_config_example.yaml --profile config/examples/_profile_example/ --use-conda -c $N -n # remove the dry-run parameters "-n" for the actual run

Configuring Rule Parameters

Switching between antiSMASH version

From BGCFlow version 0.7.0, user can switch between antiSMASH 6.1.1 and 7.0.0 by changing the version parameter in the global config.yaml file:

rule_parameters:
  antismash:
    version: 7 # valid versions: 6, 7

Using different GTDB releases

From BGCFlow version 0.7.0, user can switch between GTDB releases by changing the parameter in the global config.yaml file:

rule_parameters:
  install_gtdbtk:
    release: "214.1"
    release_version: "r214"

Check for valid release versions from https://data.gtdb.ecogenomic.org/releases/ The release and release_version refers to this examples:

release 214.1, release_version r214 --> https://data.gtdb.ecogenomic.org/releases/release214/214.1/auxillary_files/gtdbtk_r214_data.tar.gz
release 207, release_version r207_v2 --> https://data.gtdb.ecogenomic.org/releases/release207/207.0/auxillary_files/gtdbtk_r207_v2_data.tar.gz

Using `ani_screen`in GTDB-tk

From BGCFlow version 0.7.0, user can turn on GTDB-tk ani_screen by setting parameter in the global config.yaml file:

rule_parameters:
  gtdbtk:
    ani_screen: TRUE

Using GTDB_API Offline mode

Set use_gtdb_api to False to enter offline mode

rule_parameters:
  install_gtdbtk:
    release: "214.1"
    release_version: "r214"
  gtdbtk:
    ani_screen: FALSE
  antismash:
    version: "7" # valid versions: 6, 7
  use_gtdb_api: FALSE # use GTDB API to get taxonomy information

This can be combined with giving the project a custom taxonomic assignment from: https://data.gtdb.ecogenomic.org/releases/release214/214.1/bac120_taxonomy_r214.tsv

There is a script to reformat this table for BGCFlow. Example usage:

python workflow/scripts/grab_gtdb_tax_table.py --url "https://data.gtdb.ecogenomic.org/releases/release214/214.1/bac120_taxonomy_r214.tsv" --outfile config/Lactobacillus_delbrueckii/bac120_taxonomy_r214.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

07 Advanced Configurations

Further configuration

Custom Prokka database

Taxonomic Placement

Running multiple projects

Setting custom resources/databases folder

List of Configurable Features

Using snakemake profiles for further configurations

Configuring Rule Parameters

Switching between antiSMASH version

Using different GTDB releases

Using `ani_screen`in GTDB-tk

Using GTDB_API Offline mode

Clone this wiki locally

07 Advanced Configurations

Further configuration

Custom Prokka database

Taxonomic Placement

Running multiple projects

Setting custom resources/databases folder

List of Configurable Features

Using snakemake profiles for further configurations

Configuring Rule Parameters

Switching between antiSMASH version

Using different GTDB releases

Using ani_screenin GTDB-tk

Using GTDB_API Offline mode

Clone this wiki locally

Using `ani_screen`in GTDB-tk