Skip to content

Commit

Permalink
update doc
Browse files Browse the repository at this point in the history
  • Loading branch information
bqminh committed May 17, 2024
1 parent 52d6003 commit eb07bdd
Show file tree
Hide file tree
Showing 4 changed files with 98 additions and 50 deletions.
45 changes: 44 additions & 1 deletion doc/Command-Reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
layout: userdoc
title: "Command Reference"
author: 95438353+HectorBanos, Diep Thi Hoang, Dominik Schrempf, Heiko Schmidt, Jana Trifinopoulos, Minh Bui, Thomaskf, Trongnhan Uit
date: 2024-04-17
date: 2024-05-16
docid: 19
icon: book
doctype: manual
Expand All @@ -28,6 +28,8 @@ sections:
url: site-specific-frequency-model-options
- name: Tree search parameters
url: tree-search-parameters
- name: Tree search for pathogen data
url: tree-search-for-pathogen-data
- name: Ultrafast bootstrap parameters
url: ultrafast-bootstrap-parameters
- name: Nonparametric bootstrap
Expand Down Expand Up @@ -431,6 +433,46 @@ The new IQ-TREE search algorithm ([Nguyen et al., 2015]) has several parameters

iqtree -s data.phy -m TEST -g constraint.tree

Tree search for pathogen data
-----------------------------
<div class="hline"></div>

For pathogen data such as SARS-CoV-2 virus alignments, version 2.3.4.cmaple implements
the MAPLE algorithm ([De Maio et al., 2023]) that performs tree search very quickly by
exploiting the low divergent property of the sequences (i.e., sequences in the alignment
are very similar to each other).

| Option | Usage and meaning |
|----------|------------------------------------------------------------------------------|
| `--pathogen` | Apply CMAPLE tree search algorithm if sequence divergence is low, otherwise, apply IQ-TREE algorithm. |
| `--pathogen-force` | Apply CMAPLE tree search algorithm regardless of sequence divergence. |
| `-alrt` | Specify number of replicates (>=1000) to perform SH-like approximate likelihood ratio test (SH-aLRT) ([Guindon et al., 2010]). |
| `-T` | Specify the number of CPU cores to use only for the SH-aLRT test. If `-T AUTO` is specified, IQ-TREE will use all available cores. NOTE: this option has no effect on tree search, which is still single-threaded. |

### Example usages:

* Infer a maximum-likelihood tree for an alignment, automatically switching to CMAPLE algorithm
if sequence divergence is low:

iqtree2 -s data.phy --pathogen --prefix pathogen
It will print two output files:

* `pathogen.treefile`: The best approximate maximum-likelihood tree in NEWICK format.
* `pathogen.log`: The log file.


If you want to do other analyses on this tree and thus saving the tree search time,
add `-te pathogen.treefile` to the command line of a subsequent IQ-TREE run to fix this tree topology
and remove `--pathogen` option to invoke the default IQ-TREE machinery.

* Infer a tree like above and additionally assign branch supports using SH-aLRT test
with 1000 replicates using 4 CPU cores:

iqtree2 -s data.phy --pathogen --alrt 1000 -T 4 --prefix pathogen

The tree `pathogen.treefile` will contain branch supports for all internal branches.

Ultrafast bootstrap parameters
------------------------------
<div class="hline"></div>
Expand Down Expand Up @@ -729,6 +771,7 @@ The first few lines of the output file example.phy.sitelh (printed by `-wslr` op
[Adachi and Hasegawa, 1996b]: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.476.8552
[Anisimova and Gascuel 2006]: https://doi.org/10.1080/10635150600755453
[Anisimova et al., 2011]: https://doi.org/10.1093/sysbio/syr041
[De Maio et al., 2023]: https://doi.org/10.1038/s41588-023-01368-0
[Felsenstein, 1985]: https://doi.org/10.2307/2408678
[Flouri et al., 2015]: https://doi.org/10.1093/sysbio/syu084
[Gadagkar et al., 2005]: https://doi.org/10.1002/jez.b.21026
Expand Down
96 changes: 49 additions & 47 deletions doc/Estimating-amino-acid-substitution-models.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
---
layout: userdoc
title: "Estimating amino acid substitution models"
author: Cuongbb, Minh Bui, Thomaskf
date: 2024-04-18
author: 95438353+HectorBanos, Cuongbb, Minh Bui, Thomaskf
date: 2024-05-14
docid: 8
icon: info-circle
doctype: tutorial
Expand All @@ -13,26 +13,26 @@ sections:
- name: Estimating a model from a single concatenated alignment
url: estimating-a-model-from-a-single-concatenated-alignment
- name: Estimating a model from a folder of alignments
url: estimating-a-model-from-a-folder-of-alignments
url: estimating-a-model-from-a-folder-of-alignments
- name: Estimating a non-reversible model
url: estimating-a-non-reversible-model
- name: Estimating linked exchangeabilities
url: estimating-linked-exchangeabilities
- name: Estimating linked exchangeabilities
url: estimating-linked-exchangeabilities
---


Estimating amino acid substitution models
==========================

Amino acid substitution models are a key component in phylogenetic analyses of protein sequences. Most, if not all, analyses use [empirical amino-acid models](Substitution-Models#protein-models), which were obtained from protein databases; but there has been no useful tool to estimate them for modern datasets at hand. Therefore, we introduced QMaker ([Minh et al., 2021]) as a fast and convenient tool as part of IQ-TREE version 2 to infer a replacement matrix Q for any set of protein alignments.

If you use QMaker or new models (Q.pfam, Q.plant, Q.mammal, Q.bird, Q.insect, Q.yeast), please cite:


Amino acid substitution models are a key component in phylogenetic analyses of protein sequences. Most, if not all, analyses use [empirical amino-acid models](Substitution-Models#protein-models), which were obtained from protein databases; but there has been no useful tool to estimate them for modern datasets at hand. Therefore, we introduced QMaker ([Minh et al., 2021]) as a fast and convenient tool as part of IQ-TREE version 2 to infer a replacement matrix Q for any set of protein alignments.

If you use QMaker or new models (Q.pfam, Q.plant, Q.mammal, Q.bird, Q.insect, Q.yeast), please cite:

> Bui Quang Minh, Cuong Cao Dang, Le Sy Vinh, and Robert Lanfear (2021), QMaker: Fast and Accurate Method to Estimate Empirical Models of Protein Evolution. _Systematic Biology_ 70: 1046–1060. <https://doi.org/10.1093/sysbio/syab010>
Estimating a model from a single concatenated alignment
-------------------------------------------------------

-------------------------------------------------------

We first demonstrate the estimation of a reversible model for a clade-specific dataset. Please download and extract the [sample training data](data/plant_10loci.zip). This example data was subsampled from a plant dataset ([Ran et al., 2018]). There are two files in the downloaded folder:

* `alignment.nex` contains the alignment in NEXUS format.
Expand Down Expand Up @@ -93,9 +93,9 @@ The amino-acid order in this file is:
A R N D C Q E G H I L K M F P S T W Y V
Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val


Estimating a model from a folder of alignments
----------------------------------------------

Estimating a model from a folder of alignments
----------------------------------------------

We will now estimate a reversible model from a folder of alignments. Please first download the file [plant_10alignments.zip](data/plant_10alignments.zip). There is a sub-folder named `train_plant` in the downloaded folder. We use `-S` option instead of `-s` and `-p` options to allow each alignment having a separate tree. This -S option is typically used with a folder of alignments. The three commands are:

Expand All @@ -107,18 +107,18 @@ We will now estimate a reversible model from a folder of alignments. Please firs

# step 3: extract the resulting reversible matrix
grep -A 21 "can be used as input for IQ-TREE" train_plant.GTR20.iqtree | tail -n20 > Q.plant



Estimating a non-reversible model
---------------------------------

QMaker assumes time-reversible models, an assumption designed for computational convenience but not for biological reality. A variant of QMaker, called nQMaker ([Dang et al., 2022]), can estimate _non-reversible_ models and _rooted_ trees from any set of protein alignments.

If you use nQMaker or any new non-reversible models (NQ.pfam, NQ.plant, NQ.mammal, NQ.bird, NQ.insect, NQ.yeast), please cite:
---------------------------------

QMaker assumes time-reversible models, an assumption designed for computational convenience but not for biological reality. A variant of QMaker, called nQMaker ([Dang et al., 2022]), can estimate _non-reversible_ models and _rooted_ trees from any set of protein alignments.

If you use nQMaker or any new non-reversible models (NQ.pfam, NQ.plant, NQ.mammal, NQ.bird, NQ.insect, NQ.yeast), please cite:

> Cuong Cao Dang, Bui Quang Minh, Hanon McShea, Joanna Masel, Jennifer Eleanor James, Le Sy Vinh, and Robert Lanfear (2022), nQMaker: estimating time non-reversible amino acid substitution models. Systematic Biology 71: 1110–1123. <https://doi.org/10.1101/2021.10.18.464754>


To estimate a non-reversible model for a concatenated alignment, you can use `--model-joint NONREV+FO` option instead of `--model-joint GTR20+FO`:

# step 1: infer an single edge-linked tree with reversible models as initial models
Expand Down Expand Up @@ -155,9 +155,9 @@ The resulting `NQ.plant` matrix may now look like:

0.076646 0.049413 0.038372 0.049451 0.010780 0.037824 0.063761 0.052468 0.015186 0.065770 0.104298 0.072672 0.019435 0.049968 0.035179 0.078294 0.045874 0.013490 0.033377 0.087742

> HINT: To assess the statistical support of the root position with bootstraping (-B 1000 option), users can use [this tutorial](Rootstrap).
To estimate a non-reversible model from a folder of alignments:
> HINT: To assess the statistical support of the root position with bootstraping (-B 1000 option), users can use [this tutorial](Rootstrap).
To estimate a non-reversible model from a folder of alignments:

# step 1: infer a separate tree for each alignment with reversible models as initial models
iqtree2 -seed 1 -T AUTO -S train_plant -mset LG,WAG,JTT -cmax 4 -pre train_plant
Expand All @@ -169,43 +169,45 @@ To estimate a non-reversible model from a folder of alignments:
grep -A 22 "can be used as input for IQ-TREE" train_plant.NONREV.iqtree | tail -n21 > NQ.plant


Estimating linked exchangeabilities
-----------------------------------
Estimating linked exchangeabilities
-----------------------------------

Starting with version 2.3.1, IQ-TREE allows users to estimate linked exchangeabilities under [profile mixture models](Substitution-Models#protein-mixture-models).

To start with, we show an example:

iqtree2 -s <alignment> -m GTR20+C60+G4 --link-exchange-rates -te <guide_tree> -me 0.99

Here, IQ-TREE applies a (freely-estimated) 20x20 rate matrix `GTR20` with the
[profile mixture model](Substitution-Models#protein-mixture-models) `C60` (other model such as C10 can also be used) and Gamma rate heterogeneity across sites. The option `--link-exchange-rates` tells
IQ-TREE to link GTR20 rates across all 60 mixture classes: without this option
IQ-TREE will estimate 60 GTR20 matrices!

The other options are not mandatory but meant to speed up this process:

* `-te` option is to provide a _guide tree_, which is fixed throughout the estimation. This guide tree can be obtained previously from, for example, LG+C60+G or the simpler LG+G. Without this option, IQ-TREE will invoke a full tree search intertwined with model estimation, which may become very time consuming for large datasets.

* `-me 0.99` is to set the log-likelihood difference threshold of determining convergence: higher value will make the optimisation faster. Simulations have shown that changing this parameter has no significant effect on exchangeability estimation.


iqtree2 -s <alignment> -m GTR20+C60+G4 --link-exchange-rates -te <guide_tree> -me 0.99

Here, IQ-TREE applies a (freely-estimated) 20x20 rate matrix `GTR20` with the
[profile mixture model](Substitution-Models#protein-mixture-models) `C60` (other model such as C10 can also be used) and Gamma rate heterogeneity across sites. The option `--link-exchange-rates` tells
IQ-TREE to link GTR20 rates across all 60 mixture classes: without this option
IQ-TREE will estimate 60 GTR20 matrices!

The other options are not mandatory but meant to speed up this process:

* `-te` option is to provide a _guide tree_, which is fixed throughout the estimation. This guide tree can be obtained previously from, for example, LG+C60+G or the simpler LG+G. Without this option, IQ-TREE will invoke a full tree search intertwined with model estimation, which may become very time consuming for large datasets.

* `-me 0.99` is to set the log-likelihood difference threshold of determining convergence: higher value will make the optimisation faster. Simulations have shown that changing this parameter has no significant effect on exchangeability estimation.


This command will produce an output file with suffix `.GTRPMIX.nex`. This file contains the optimized exchangeabilities in NEXUS format, that can be applied in later analyses (without re-estimating them) to reconstruct a tree, for example:

iqtree2 -s <alignment> -mdef <.GTRPMIX.nex file> -m GTRPMIX+C60+G4

iqtree2 -s <alignment> -mdef <.GTRPMIX.nex file> -m GTRPMIX+C60+G4

The optimizer in IQ-TREE by default initializes exchangeability rates to be all equal, which are the least biased but may make the subsequent optimization quite slow. If users have a good guess of the rate values, the option `--gtr20-model` can be used. For example, `--gtr20-model LG` will intialize the exchangeability to that
of the LG model before optimization. Choosing good starting values can make estimation considerably faster. Apart from LG, users can specify any matrix, including those defined by the `-mdef` option with a [NEXUS model file](Complex-Models#nexus-model-file). Another use of this option is to _test the robustness_ of the optimizer with different starting points.

The optimizer in IQ-TREE by default initializes exchangeability rates to be all equal, which are the least biased but may make the subsequent optimization quite slow. If users have a good guess of the rate values, the option `--gtr20-model` can be used. For example, `--gtr20-model LG` will intialize the exchangeability to that
of the LG model before optimization. Choosing good starting values can make estimation considerably faster. Apart from LG, users can specify any matrix, including those defined by the `-mdef` option with a [NEXUS model file](Complex-Models#nexus-model-file). Another use of this option is to _test the robustness_ of the optimizer with different starting points.

Note that the user can estimate exchangeabilities jointly with weights of the profiles, branch lengths, and rates. This can be very time-consuming. If the goal is to optimize exchangeabilities, one can fix the other parameters to reasonable estimates (for eg. fixing branch lengths and rates has been shown to perform adequately for the estimation of exchangeabilities).

Because these routines can be computationally expensive, two exchangeability matrices estimated from large concatenated phylogenomic-supermatrices under the C60 profile mixture model are provided to be used for phylogenetic analyses. One, called Eukaryotic Linked Mixture (ELM), is designed for phylogenetic analysis of proteins encoded by nuclear genomes of eukaryotes, and the other, Eukaryotic and Archeal Linked mixture (EAL), for reconstructing relationships between eukaryotes and Archaea, see [Protein models](Substitution-Models#protein-models).

If you use this routine in a publication please cite:

> __H. Banos et al.__ (2024) GTRpmix: A linked general-time reversible model for profile mixture models. _BioRxiv_. <https://doi.org/10.1101/2024.03.29.587376>

[Dang et al., 2022]: https://doi.org/10.1093/sysbio/syac007
[Dang et al., 2022]: https://doi.org/10.1093/sysbio/syac007
[Minh et al., 2021]: https://doi.org/10.1093/sysbio/syab010
[Naser-Khdour et al., 2021]: https://doi.org/10.1093/sysbio/syab067
[El-Gebali et al., 2018]: https://doi.org/10.1093/nar/gky995
Expand Down
7 changes: 5 additions & 2 deletions doc/Substitution-Models.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
---
layout: userdoc
title: "Substitution Models"
author: Cuongbb, Heiko Schmidt, Jana Trifinopoulos, Minh Bui, Trongnhan Uit
date: 2022-05-31
author: 95438353+HectorBanos, Cuongbb, Heiko Schmidt, Jana Trifinopoulos, Minh Bui, Trongnhan Uit
date: 2024-05-14
docid: 10
icon: book
doctype: manual
Expand Down Expand Up @@ -165,6 +165,8 @@ IQ-TREE supports all common empirical amino-acid exchange rate matrices (alphabe
| cpREV | chloroplast |chloroplast matrix ([Adachi et al., 2000]). |
| Dayhoff | nuclear | General matrix ([Dayhoff et al., 1978]). |
| DCMut | nuclear | Revised `Dayhoff` matrix ([Kosiol and Goldman, 2005]). |
| EAL | nuclear | General matrix. To be used with profile mixture models (for eg. EAL+C60) for reconstructing relationships between eukaryotes and Archaea ([Banos et al., 2024]). |
| ELM | nuclear | General matrix. To be used with profile mixture models (for eg. ELM+C60) for phylogenetic analysis of proteins encoded by nuclear genomes of eukaryotes ([Banos et al., 2024]).|
| FLAVI | viral | Flavivirus ([Le and Vinh, 2020]). |
| FLU | viral | Influenza virus ([Dang et al., 2010]). |
| GTR20 | general | General time reversible models with 190 rate parameters. *WARNING: Be careful when using this parameter-rich model as parameter estimates might not be stable, especially when not having enough phylogenetic information (e.g. not long enough alignments).* |
Expand Down Expand Up @@ -411,6 +413,7 @@ Users can fix the parameters of the model. For example, `+I{0.2}` will fix the p
[Abascal et al., 2007]: https://doi.org/10.1093/molbev/msl136
[Adachi and Hasegawa, 1996]: https://doi.org/10.1007/BF02498640
[Adachi et al., 2000]: https://doi.org/10.1007/s002399910038
[Banos et al., 2024]: https://doi.org/10.1101/2024.03.29.587376
[Bielawski and Gold, 2002]: https://doi.org/10.1093/genetics/161.4.1589
[Dang et al., 2010]: https://doi.org/10.1186/1471-2148-10-99
[Dang et al., 2022]: https://doi.org/10.1093/sysbio/syac007
Expand Down
Binary file modified doc/iqtree-doc.pdf
Binary file not shown.

0 comments on commit eb07bdd

Please sign in to comment.