Skip to content
Chiara Vanni edited this page Jun 16, 2022 · 7 revisions

AGNOSTOS Workflow User Guide

wiki under construction!

Table of Contents

General

Requirements

Installation

Usage

General

AGNOSTOS is the computational interpretation of our conceptual framework. It relies on extensive quality control of the inferred gene clusters and remote homology methods to dive into the twilight zone of sequence similarity.

The conceptual framework

We have created four main categories to partition the coding sequence space:

Known with Pfam annotations: This category contains genes annotated to contain one or more Pfam entries (domain, family, repeats or motifs; hereby referred as Pfam domains), but excluding the domains of unknown function (DUF).

Known without Pfam annotations: This category contains genes that have a known function but lack a Pfam annotation.

Genomic Unknown: This category contains genes that have an unknown function (DUF are included here) and found in sequenced or draft genome.

Environmental Unknown: This category contains genes of unknown function not detected in sequenced or draft genomes, but only in environmental metagenomes or metagenome-assembled genomes.

These categories are thought to be dynamic and clusters can be moved from one to the other when the level of characterisation changes. Combining the idea of bringing a structure into the UNKNOWN functional space and a sequence-clustering approach, we created a workflow that combines state-of-the-art clustering strategies (Steinegger and Söding 2017) with a strong emphasis in cluster quality (in terms of high intra-cluster homogeneity) and deep categorisation of the clusters of unknowns.

The computational workflow

We build it from five main consecutive steps, which start from a gene prediction and end in an in-depth ORF cluster classification and cluster aggregation in cluster communities. The clustering method is based on sequence similarity and for the functional annotation we used a protein domain based approach, with the idea to use the domain architectures as a constraint for our clusters. We validated both intra-cluster sequence composition and functional annotation. During the categorisation we also considered the detection of remote homologies, applying HMM profile vs HMM profile searches. And this was done also at the cluster level, to aggregate cluster showing distant homologies into cluster communities.

Check our website for a detailed description of the methods for each workflow step (https://dark.metagenomics.eu/workflow).

Requirements

AGNOSTOS is a complex workflow that relies on many external dependencies. Furthermore, to achieve high sensitivity levels, some of the steps are computationally expensive, especially for large datasets. That is why some steps of AGNOSTOS uses MPI compiled programs to be able to process large amount of data.

In general, AGNOSTOS was implemented on a UNIX operating system and tested in Linux. The workflow was developed using the de.NBI Cloud (https://www.denbi.de/cloud). We used a cluster setup with 10 nodes of 28 cores and 252 G of memory each. The cluster was build using BiBiGrid (https://github.com/BiBiServ/bibigrid) and it is using SLURM for job scheduling and cluster management.

The majority of the programs needed by the workflow will be installed via conda (check the environment here).

Additional packages not found in conda can be installed using the installation_script.sh.

The installation of these packages required a cmake version >= 3.3 and OPENMPI.

Soon we hope to provide Singularity images to make the whole process way more accessible.

Additionally, AGNOSTOS relies on several public databases, which are used for the GC classification. The list of required databases can be found in the script download_DBs.sh.

Installation

AGNOSTOS can be installed for Linux.

To set up AGNOSTOS in your computer follow these steps:

  1. Clone the repository: git clone https://github.com/functional-dark-side/agnostos-wf and cd agnostos-wf/

  2. Install packages not in Conda. Please, check the installation script installation_script.sh (sh installation_script.sh) and in case you are missing some of the listed programs install them using the commands from the script.

  3. Note on MMseqs2: the program can be installed via conda, however the latest workflow version was tested using the version "9cc89aa594131293b8bc2e7a121e2ed412f0b931" and newer releases could affect the workflow performance and results.

  4. Check that you have the required external DBs listed in the config.yaml file (under "Databases"). In case you miss some of them, you can find the instructions for the download in the script download_DBs.sh. If you want to download all needed databases at once run sh download_DBs.sh (Please note that this require about 350GB of space and be patient, as this may take a while...). Alternatively, the DBs can be downloaded separately in the various steps and removed after use by specifying db_mode: "memory" in the config.yaml file.

Usage

The workflow is based on Snakemake for the easy processing of large datasets in a reproducible manner. It provides three different strategies to analyse the data.

AGNOSTOS analytical environment (https://doi.org/10.1101/2021.06.07.447314)

The DB-creation module creates a gene cluster database, and validates and partitions the gene clusters (GCs) in the main functional categories.

The DB-update module allows the integration of new sequence data (either at the contig or predicted gene level) into existing GC databases.

The profile-search function allows to quickly screen the GC PSSM profiles in the database.

Run the workflow modules

First, check the configuration files (.yaml) in the config/ folder. To change the program and output paths to your designated folders you can use the following commands:

# cd into the workflow directory
cd Workflow

sed -i 's|vol/cloud/agnostos-wf/workflow|/your/wotkflow/path|g' config/config.yaml
sed -i 's|vol/cloud/agnostos-wf/workflow|/your/wotkflow/path|g' config/config_communities.yaml

# your data directory
sed -i 's|/vol/cloud/agnostos_test/db_update_data|/your/data/path|g' config/config.yaml

# your results directory
sed -i 's|/vol/cloud/agnostos_test/db_update|/your/results/path|g' config/config.yaml
sed -i 's|/vol/cloud/agnostos_test/db_update|/your/results/path|g' config/config_communities.yaml

# the directory of the existing GC database
sed -i 's|/vol/cloud/agnostosDB|/your/GC_DB/path|g' config/config.yaml

# the directory to the external databases
sed -i 's|/vol/cloud/agnostos-wf/databases|/your/external_database/path|g' config/config.yaml
sed -i 's|/vol/cloud/agnostos-wf/databases|/your/external_database/path|g' config/config_communities.yaml

# OPTIONAL: the directory to the binaries needed by the workflow,
# by default in the workflow folder under the directory bin/
sed -i 's|/vol/cloud/agnostos-wf/bin/|/your/binaries/path|g' config/config.yaml

Additionally you will have to specify if your data consists of contigs (data_stage: "contigs"), self predicted genes sequences (data_stage: "genes") or the gene prediction retrieved with anvi'o (data_stage: "anvio_genes"), and provide the name of the input files in the config.yaml file in the following entries:

# Gene or contig file
data: "/your/data/path/your_genes.fasta"

# specify at which stage are your data, can be either "genes" or "contigs"
data_stage: "genes" #"contigs" or "genes" or "anvio_genes"

# If you already have the gene predictions, please provide path to gene completeness information
## In case your data comes from an anvi'o contigDB, please specify here the anvi'o gene_calls.tsv file,
## retrieved via "anvi-export-gene-calls -c CONTIGS.db -o anvio_gene_calls.tsv"
data_partial: "/vol/cloud/agnostos_test/db_update_data/new_genes_partial_info.tsv"

NB: In case you have separate files for each contig or gene prediction, please concatenate the files in a single multi-fasta file.

DB-CREATION

The DB-creation module starts from a set of genomic/metagenomic contigs in fasta format and retrieves a database of categorised gene clusters and cluster communities.

DB-creation module

To run the module:

cd workflow/
snakemake --use-conda -j 100 --config module="creation" --cluster-config config/cluster.yaml --cluster "sbatch --export=ALL -t {cluster.time} -c {threads} --ntasks-per-node {cluster.ntasks_per_node} --nodes {cluster.nodes} --cpus-per-task {cluster.cpus_per_task} --job-name {rulename}.{jobid} --partition {cluster.partition}" -R --until creation_workflow_report

The test this module you can download a small dataset of 10K contigs from two TARA Oceans samples:

mkdir -p agnostos_test
cd agnostos_test
wget https://figshare.com/ndownloader/files/31247614 -O db_creation_data.tar.gz
tar -xzvf db_creation_data.tar.gz

DB-UPDATE

The DB-Update module integrates your genomic/metagenomic contigs or genes to an existing GC database such as the agnostosDB dataset, which is stored in Figshare (https://doi.org/10.6084/m9.figshare.12459056) and publicly available for download. In case you cannot download the whole dataset, seen to the large size of many of the files, the workflow will download the necessary files for each step and it will then remove them. A description of the agnostosDB files can be found in the [AgnostosDB_README.md](AgnostosDB_README.md

The main benefit here, besides scalability, is that no data is left behind and it is immediately placed in a greater context thanks to the contextual data associated with the existing GCs.

DB-update module

  • To run the DB-update module of the workflow, you just need to enter the folder, modify the config.yaml and config_communities.yml files specifying your input data and the output paths (see usage file), and then run the same command shown above, this time modifying the configuration parameter 'module' to "update":
cd workflow/
snakemake -s Snakefile --use-conda -j 100 --config module="update" --cluster-config config/cluster.yaml --cluster "sbatch --export=ALL -t {cluster.time} -c {threads} --ntasks-per-node {cluster.ntasks_per_node} --nodes {cluster.nodes} --cpus-per-task {cluster.cpus_per_task} --job-name {rulename}.{jobid} --partition {cluster.partition}" -R --until update_workflow_report

The test this module you can download a small dataset of 5K contigs from a TARA Oceans sample:

mkdir -p agnostos_test
cd agnostos_test

wget https://ndownloader.figshare.com/files/25473335 -O db_update_data.tar.gz
tar -xzvf db_update_data.tar.gz

If you ran the DB-creation module on the TARA test dataset you can use that GC database to test the DB-update module. To do this you need to first copy or move the MMSeqs2 cluster database in the final DB-creation results:

# In the agnostos_test/ folder

mv db_creation/mmseqs_clustering db_creation/clusterDB_results/

This is a general rule: if you want to run the DB-update module on the results of the DB-creation module, first copy or move the cluster database ("mmseqs_clustering/") in the final DB-creation result folder.

Alternatively, you can run the update module integrating the new data into one of AGNOSTOS databases, which can be dowloaded from Figshare:

  • seedDB

    • 1,829 marine and human metagenomic assemblies and 28,941 bacterial and archaeal genomes, for a total of 5,287,759 validated GCs and 335,439,673 genes.
  • seedDB + TARA giant viruses

    • Integration of 3,243 environmental and cultivar viral genomes affiliated to the phylum Nucleocytoviricota (realm Varidnaviria) and close relatives into the seedDB. The resulting DB contains 5,383,876 GCs and 336,513,365 genes.
  • seedDB + TARA giant viruses + TARA eukaryotes [fixing bug...]

    • Integration of 713 environmental genomes of unicellular plankton eukaryotes. The resulting DB contains 6,572,081 validated GCs and 341,655,294 genes.

PROFILE-SEARCH

The profile-search module does not require an HPC environment and can be run on a local computer following the steps below and installing MMseqs2 in case you don't have it yet:

# download the AGNOSTOS seed database gene cluster profiles
wget https://figshare.com/ndownloader/files/30998305 -O mmseqs-profiles.tar.gz
tar -xzvf mmseqs-profiles.tar.gz

# download the AGNOSTOS seed database gene cluster categories
wget https://ndownloader.figshare.com/files/23067140 -O cluster_ids_categ.tsv.gz
gunzip cluster_ids_categ.tsv.gz

# Run the sequence-profile search
Profile_search/profile_search.sh --query your-genes.fasta --clu_hmm mmseqs-profiles/clu_hmm_db --clu_cat cluster_ids_categ.tsv --mmseqs /path/to/mmseqs --mpi FALSE --threads 8

NOTE: This function can also be run on MAC-OS-X, but you will probably need to install the gnu-getopt, which supports long options (--). For this you can use the command conda install -c bioconda gnu-getopt or with Homebrew as brew install gnu-getopt.