Skip to content

3. Build a model

Helena Cooper edited this page Jul 9, 2024 · 24 revisions

The Bactabolize draft_model function

bactabolize draft_model \
  --assembly_fp test_data/data/working_model/K.pneumoniae_KPPR1.gbk \
  --ref_genes_fp KpSC_pan-metabolic_model_v2_nucl.fna \
  --ref_proteins_fp KpSC_pan-metabolic_model_v2_prots.faa \
  --ref_model_fp KpSC_pan-metabolic_model_v2.json \
  --biomass_reaction_id BIOMASS_Core_Feb2022 \
  --output_fp K_pneumoniae_KPPR1

Metabolic reconstructions ('models') are generated via the bactabolize draft_model command which requires an input genome assembly and reference model. Bactabolize uses Blast+ to identify orthologs of the reference model coding sequences (CDSs) among the input genome CDSs (uses bi-directional best hits). Once a draft model is constructed, Bactabolize will automatically test the model for the ability to simulate growth on your choice of media under your choice of atmosphere (we recommend testing for conditions in which all of your strains are expected to grow e.g. for Klebsiella pneumoniae we use m9 minimal media plus aerobic atmosphere, which are the default options). If the model does not simulate growth under these conditions, Bactabolize will automatically attempt to identify the essential missing reactions via the COBRApy gap-filling function. bactabolize patch_model can be used to add these reactions into the model.

draft_model flow chart

Input files

Bactabolize requires two inputs:

1. Genome assembly(ies) representing the strain(s) for which models should be built

This can be either an unannotated fasta file or an annotated genbank file.

Bactabolize will honour the exisiting coding sequence regions marked in the genbank file and/or will use Prodigal to perform an additional search for coding sequences. You can use completed or draft genome assemblies but it is important to consider the assembly quality, particularly for draft genomes. Poor quality assemblies i.e. those that are highly fragmented or contain many homopolymer sequence errors will result in low quality models with many missing genes and reactions. For the Klebsiella pneumoniae Species Complex we recommended minimum draft genome assembly quality following a hierarchical approach:

If the assembly has ≤200 assembly graph dead ends (calculate from .gfa or fastg), build the model. If assembly graphs are not available, ≥65000 N50, build the model. Lastly, ≤130 contigs. The final two metrics can be species-specific so you may need to tweak this. See the Bactabolize paper for a description of our quality control framework and explanation of these thresholds.

2. Reference model (plus its associated sequence data)

The reference model should be provided in BiGG compatible JSON format (you can use this helper script to convert models from SEED formats into Bactabolize-compatible files). The corresponding gene and protein sequences can be provided either as two separate multi-fasta files (useful for references derived from multiple strains, which we call 'pan-reference' models), or within a single genbank annotation file (useful for single strain references). The sequence data gene IDs must match exactly to those contained in the reference model file.

To accurately capture the metabolism of your intended strains and build high-quality models, it is important to choose an appropriate reference, ideally one that has been curated and validated using phenotype data. We recommend using a reference representing a strain that is closely related to your strains(s) of interest, or better yet, using a 'pan-reference' that captures as much diversity as possible for your species of interest. Pan-references can be particularly useful for highly divergent species i.e. those with large pan-genomes where gene content (including metabolism-associated genes) can differ substantially between strains e.g. E. coli, K. pneumoniae, Pseudomonas aeruginosa.

The following pan-reference metabolic models are available:

Species Compatible with Bactabolize Database Reference
K. pneumoniae Species Complex Yes Model & Associated Sequences Cooper 2024, MGen
Salmonella No Model Seif 2018, Nat Commun
Bacillus subtilis Yes Model & Gene Annotations Blázquez 2023, Int. J. Mol. Sci.
Lactobacillaceae No Model Ardalani 2024, mSystems

Note that any JSON model can be used as as a reference and is compatible with Bactabolize, so long as the nucletoide and protein sequences corresponding to the GPRs in the model are available.

Outputs

Bactabolize will output a draft metabolic model(s) in BiGG compatible JSON format and SMBL Level 3 Version 1) format (optional, useful for input to third-party tools, such as Fluxer), as well as the updated coding sequence annotations (optional) and memote report of model quality (optional). See the table below for file naming conventions.

Filename Description
assembly_id.gbk Prodigal annotation of input assembly
assembly_id_model.json Metabolic model in .json format
assembly_id_model.xml Metabolic model in .xml format (SMBL Level 3 Version 1)
assembly_id_model.html Memote model report
assembly_id_gene_dictionary.csv Matched gene names between the genbank and model gene annotations
assembly_id_unannotated_sequences.fasta Fasta file of unannotated sequences identified by Bactabolize

Command options

Required

--assembly_fp - Input assembly for which a metabolic model will be generated.

--output_fp - Output filename

--ref_model_fp - Reference model in .json format.

--ref_genes_fp AND --ref_proteins_fp - Reference gene sequences and matched translated amino acid sequences (fasta).

OR

ref_genbank_fp - Reference genome (genbank).

Optional

--biomass_reaction_id - reaction ID of reference model's Biomass function. DEFAULT: BIOMASS_

--media_type - Growth media for initial model growth simulation test. One of: cdm_mendoza, bg11, lb, lb_carveme, m9, nutrient, pmm5_mendoza, pmm7_mendoza, tsa, tsa_sheep_blood. DEFAULT: m9

--atmosphere_type - Atmosphere for model testing. One of: aerobic, anaerobic. DEFAULT: aerobic

--min_coverage - Set minimum query coverage (%) for ortholog identification via bi-directional best hit blast+. DEFAULT: 25

--min_pident - Set minimum identity (%) for ortholog identification via bi-directional best hit blast+. DEFAULT: 80

--min_ppos - Set minimum protein similarity (positives, %) for ortholog identification via bi-directional best hit blast+. DEFAULT: OFF. Can be used instead of --min_pident to allow for greater tolerance of similarly-functional but different residues.

--memote_report_fp - output file path for memote model quality report. Note that this will significantly increase compute time e.g. +5 minutes PER assembly on a standard 1.60GHz laptop.

--no_reannotation - Will prevent the re-annotation of input genome assemblies if provided in genbank format. DEFAULT: Off. If set to ON Prodigal is used to identify coding regions.

Examples

# Create draft model for genbank input assembly using genbank reference on M9 media
# under aerobic conditions, at 25% query coverage and 80% protein similarity
bactabolize draft_model \
    --assembly_fp data/working_model/K.pneumoniae_KPPR1.gbk \
    --ref_genbank_fp reference.gbk \
    --ref_model_fp reference_model.json \
    --biomass_reaction_id biomass_equation \
    --media_type m9 \
    --atmosphere_type aerobic \
    --output_fp input_assembly_qc_25_sim_85 \
    --min_coverage 25 \
    --min_ppos 80

# Create draft model for fasta input assembly using multi-fasta reference, anerobically on PMM7 media
# at 25% query coverage and 75% protein identity. Produce memote report
bactabolize draft_model \
    --assembly_fp input_assembly.fasta \
    --ref_genes_fp KpSC_pan-metabolic_model_v1_nucl.fna \
    --ref_proteins_fp KpSC_pan-metabolic_model_v1_prots.faa \
    --ref_model_fp KpSC_pan-metabolic_model_v1.json \
    --biomass_reaction_id BIOMASS_Core_Oct2019 \
    --media_type pmm7 \
    --atmosphere_type anaerobic \
    --output_fp input_assembly_qc_25_sim_85 \
    --min_coverage 25 \
    --min_pident 75 \
    --memote_report_fp input_report

Growth mediums

Bactabolize comes with a list of pre-generated medias in the data/fba_specs/ directory (see table below), but users can make custom medias as well.

Predefined medias

Media type Reference File with ingredients
cdm_mendoza Mendoza et al., (2019) cdm_mendoza_spec.json
bg11 This study, ThermoFisher bg11_spec.json
lb This study. Peptone: BD, 2015, Loginova et al., (1974), ThermoFisher, (2019). Yeast extract: Tomé, (2021), Plata et al., (2013), , Liu et al., (2018), Blagović et al., (2001), Blagović et al., (2005), Avramia et al., (2021) lb_spec.json
lb_carveme Machado et al., (2018) lb_carveme_spec.json
m9 Norsigian et al., (2010) m9_spec.json
nutrient This study. Peptone: BD, 2015, Loginova et al., (1974), ThermoFisher, (2019). Beef extract: BD, 2015, ThermoFisher, (2019) nutrient_spec.json
pmm5_mendoza Mendoza et al., (2019) pmm5_mendoza_spec.json
pmm7_mendoza Mendoza et al., (2019) pmm7_mendoza_spec.json
tsa This study. Tryptic soy: BD, 2015, ThermoFisher, (2019), Hagely et al., (2013), Choct et al., (2010). Verbascose was added as an exchange reaction as ‘EX_vrbsc_e’. tsa_spec.json
tsa_sheep_blood This study. Tryptic soy: BD, 2015, ThermoFisher, (2019), Hagely et al., (2013), Choct et al., (2010). Verbascose was added as an exchange reaction as ‘EX_vrbsc_e’ tsa_sheep_blood_spec.json

Custom medias

You can define a custom media by creating a JSON format file containing the media name, the set of exchange reactions corresponding to the components of the media and the corresponding flux values to be applied to these reactions (these should be negative values, which specifies that the reaction(s) are open).

Exchange reactions are a subset of the model reactions that specify that a particular substrate can be transferred from the extracellular space into the cell. They can be identified by the EX_ prefix.

You can search for exchange reactions corresponding to your metabolites of interest at the BiGG website. Make sure to the use reaction IDs with the e suffix, which specifies extracellular exchange.

The example below shows the definition for m9 minimal media.

{
  "name": "M9",
  "exchanges": {
    "EX_ca2_e":      -20,
    "EX_cbl1_e":     -0.01,
    "EX_cl_e":       -20,
    "EX_cobalt2_e":  -20,
    "EX_cu2_e":      -20,
    "EX_fe2_e":      -20,
    "EX_fe3_e":      -20,
    "EX_glc__D_e":   -20,
    "EX_h2o_e":      -20,
    "EX_h_e":        -20,
    "EX_k_e":        -20,
    "EX_mg2_e":      -20,
    "EX_mn2_e":      -20,
    "EX_mobd_e":     -20,
    "EX_na1_e":      -20,
    "EX_nh4_e":      -20,
    "EX_ni2_e":      -20,
    "EX_pi_e":       -20,
    "EX_so4_e":      -20,
    "EX_tungs_e":    -20,
    "EX_zn2_e":      -20
  }
}

Once the file is ready save it with the file name suffix _media.json e.g. custom1_media.json and copy it into the bactabolize/data/media_definitions/ directory for Bactabolize to find it. If you installed Bactabolize via Conda the directory path will look something like this:

miniconda3/envs/bactabolize/lib/{your-python-version}/site-packages/bactabolize/data/media_definitions/

To use the media with the draft_model function use the --media_type option e.g. --media_type custom1.