Skip to content

07 Advanced Configurations

Matin Nuhamunada edited this page Oct 16, 2023 · 3 revisions

Further configuration

Custom Prokka database

You can add an optional parameter: prokka-db, which refers to the location of a .csv file containing a list of your custom reference genomes for prokka annotation:

projects:
  - name: example
    samples: config/samples.csv
    prokka-db: config/prokka-db.csv

The file prokka-db.csv should contain a list of high-quality annotated genomes that you would like to use to prioritize prokka annotations.

prokka-db.csv example for Actinomycete group:

Accession Strain Description
GCA_000203835.1 Streptomyces coelicolor A3(2)
GCA_000196835.1 Amycolatopsis mediterranei U32

Taxonomic Placement

The workflow will prioritize user-provided taxonomic placement by adding an optional parameter: gtdb-tax, which refers to a similar GTDB-tk summary file, but only the "user_genome" and "classification" columns are required.

gtdbtk.bac120.summary.tsv example:

user_genome classification
P8-2B-3.1 d__Bacteria;p__Actinobacteriota;c__Actinomycetia;o__Streptomycetales;f__Streptomycetaceae;g__Streptomyces;s__Streptomyces albidoflavus

If these are not provided, the workflow will use the closest_placement_reference columns in the sample file (see above). Note that the value must be a valid genome accession in the latest GTDB release (currently R202), otherwise, it will raise an error.

If this information is not provided, then the workflow will guess the taxonomic placement by:

  1. If the source is ncbi, it will try to find the accession via GTDB API. If it doesn't find any information then,
  2. It will use the genus table and find the parent taxonomy via GTDB API, which then results in _genus_ sp. preceded by the matching parent taxonomy.
  3. If both option does not find any taxonomic information, then it will return empty taxonomic values.

Running multiple projects

You can have multiple projects running by starting a new line of project information with "-":

projects:
# Project 1
  - name: example
    samples: config/samples.csv
    prokka-db: config/prokka-db.csv
# Project 2
  - name: example_2
    samples: config/samples_2.csv

Note that each project must have a unique name and samples value.

Setting custom resources/databases folder

By default, the resources folder containing software and database dependencies is stored in the resources/ directory.

If you already have the resources folder somewhere else in your local machine, you can tell the workflow about their locations:

resources_path:
  antismash_db: $HOME/your_local_directory/antismash_db
  eggnog_db: $HOME/your_local_directory/eggnog_db
  BiG-SCAPE: $HOME/your_local_directory/BiG-SCAPE

List of Configurable Features

Here you can find rules keywords that you can run within BGCflow.

Keywords Description Links
seqfu Returns contig statistics of the genomes SeqFu
mlst Returns genome classification within multi-locus sequence types (STs) mlst
refseq_masher Identify the closest 10 NCBI Refseq genomes RefSeq Masher
mash Calculate genomic distance using MAST MASH
fastani Calculate nucleotide distance using fastANI fastANI
checkm Assess genome quality CheckM
gtdbtk Identify taxonomy of genomes using GTDB-toolkit GTDBTk
prokka-gbk Returns annotated .gbk files Prokka
diamond Create diamond database for alignment DIAMOND
antismash-summary Summary of BGCs statistics antiSMASH
antismash-zip Returns zipped antiSMASH result antiSMASH
query_bigslice Query BGCs with BiG-FAM db* BiG-SLICE
bigscape Build Gene Cluster Families with BiG-SCAPE BiG-SCAPE
bigslice Build Gene Cluster Families with BiG-SLICE BiG-SLICE
automlst_wrapper Build a phylogenomic tree with autoMLST wrapper autoMLST-wrapper, autoMLST
roary Build Pangenome Roary
eggnog Functional annotation with EggNOG-mapper EggNOG-mapper
deeptfactor Prediction of transcription factors with DeepTFactor DeepTFactor
roary++ Apply multiple tools together with Roary pangenome (diamond, automlst_wrapper, eggnog, deeptfactor) Roary
cblaster-genome Generate cblaster databases for genomes in project cblaster
cblaster-bgcs Generate cblaster databases for bgcs in project cblaster

Using snakemake profiles for further configurations

When using different machines, you can, for example, adapt the number of threads required for each rule using a Snakemake profile. An example is given in config/examples/_profile_example/config.yaml:

set-threads:
  - antismash=4
  - arts=4
  - bigscape=32
  - bigslice=16

You can use run a snakemake job with the above profile with:

snakemake --profile config/examples/_profile_example/ --use-conda -c $N -n # remove the dry-run parameters "-n" for the actual run

Or also with a defined config file:

snakemake --configfile config/examples/_config_example.yaml --profile config/examples/_profile_example/ --use-conda -c $N -n # remove the dry-run parameters "-n" for the actual run

Configuring Rule Parameters

Switching between antiSMASH version

From BGCFlow version 0.7.0, user can switch between antiSMASH 6.1.1 and 7.0.0 by changing the version parameter in the global config.yaml file:

rule_parameters:
  antismash:
    version: 7 # valid versions: 6, 7

Using different GTDB releases

From BGCFlow version 0.7.0, user can switch between GTDB releases by changing the parameter in the global config.yaml file:

rule_parameters:
  install_gtdbtk:
    release: "214.1"
    release_version: "r214"

Check for valid release versions from https://data.gtdb.ecogenomic.org/releases/ The release and release_version refers to this examples:

Using ani_screenin GTDB-tk

From BGCFlow version 0.7.0, user can turn on GTDB-tk ani_screen by setting parameter in the global config.yaml file:

rule_parameters:
  gtdbtk:
    ani_screen: TRUE

Using GTDB_API Offline mode

Set use_gtdb_api to False to enter offline mode

rule_parameters:
  install_gtdbtk:
    release: "214.1"
    release_version: "r214"
  gtdbtk:
    ani_screen: FALSE
  antismash:
    version: "7" # valid versions: 6, 7
  use_gtdb_api: FALSE # use GTDB API to get taxonomy information

This can be combined with giving the project a custom taxonomic assignment from: https://data.gtdb.ecogenomic.org/releases/release214/214.1/bac120_taxonomy_r214.tsv

There is a script to reformat this table for BGCFlow. Example usage:

python workflow/scripts/grab_gtdb_tax_table.py --url "https://data.gtdb.ecogenomic.org/releases/release214/214.1/bac120_taxonomy_r214.tsv" --outfile config/Lactobacillus_delbrueckii/bac120_taxonomy_r214.tsv