Phylosandbox

Phylosynth

We are marching on Phylosynth project! Some primary data, code and results will share here. Our goal is reconstructing a larger-scale plant Tree of Life for all seed plants (Spermatophyta), using methods described in Smith and Brown (2018) and ideas described in Eiserhardt et al. (2018; see below), and integrating the phylogenetic backbone from the Plant and Fungal Trees of Life Project (PAFTOL) and robust taxonomy database from World Checklist of Selected Plant Families (WCSP). We endeavor to push the boundary of the knowledge of Tree of Life, keeping this tree portable and dynamically updated, providing knowledge of the plant tree of life to science community and the public education.

Some key feactures:

Flexible

  Easy for other pipelines to integrate

Dynamically updated

   Establish a schedule for running this pipeline at regular intervals, producing up-to-date trees. For this, we need to decide an initial frequency for generating trees. This frequency can later be adjusted based on download statistics and user feedback.

Portable for different audiences

   Establish one or more outlet(s) for PhyloSynth trees. This needs to take into consideration where different audiences would be looking for trees, and ensure (for scientific audiences) that there is a citable paper.

High quality

   Build module that maps NCBI taxonomy to a widely accepted botanical taxonomy. This should in the first place be the WCSP/”names backbone” at Kew, but we need to consider the fact that other lists are in circulation.  
  
  Build a module that filters NCBI data automatically according to certain rules. This could be a simple decision tree based on metadata, or a more complex machine learning approach. 
  
  Build a module that evaluates resulting trees automatically using a set of statistics. This could include, among other things, monophyly statistics for higher ranks from the taxonomy used (genera and families in the case of WCSP).  

  Establish a procedure for manual quality control by taxon experts. This would need to include a procedure for storing decisions/annotations and avoiding duplication of effort.  
  
  Establish a procedure for user feedback. This would need to include a procedure for storing decisions/annotations and avoiding duplication of effort.

The general workingflow is outlined below.

All the scripts can be found here.

Task CheckList

Taxonomic databases
- WCSP
  #meta data "meta.xml"
  # Column0=taxonID
  # Column1=modified
  # Column2=verbatimTaxonRank
  # Column3=scientificName
  # Column4=family
  # Column5=genus
  # Column6=specificEpithet
  # Column7=infraspecificEpithet
  # Column8=scientificNameAuthorship
  # Column9=nomenclaturalStatus
  # Column10=rightsHolder
  # Column11=namePublishedInYear
  # Column12=nomenclaturalCode
  # Column13=taxonRemarks
  # Column14=bibliographicCitation
  # Column15=language
  # Column16=class
  # Column17=references
  # Column18=license
  # Column19=rights
  # Column20=namePublishedIn
  # Column21=taxonRank
  # Column22Plantae=kingdom
  # Column23=phylum
  # Column24=parentNameUsageID
  # Column25=acceptedNameUsageID#
  # Column26=originalNameUsageID
  # Column27=taxonomicStatus#
  # Column28=source
  - We only keep these columns from WCSP for downstream mapping:
    taxonID, verbatimTaxonRank, scientificName, genus, specificEpithet, infraspecificEpithet, scientificNameAuthorship, family, acceptedNameUsageID, taxonomicStatus
    WCSP_database.R
- NCBI taxonomy
  - Made new NCBI databased using phlawd_db_maker:
    phlawd_db_maker pln /data_vol/miao/plnDB20191101/plnDB20191101.db
  - Modified the PyPHLAWD get_ncbi_tsv.py script as get_ncbi_tsv_miao.py, by adding in more detailed filters in SQL syntax to get a semi-clean Spermatophyta58024 taxonomy and with authority information:
    
    c.execute("select ncbi_id,parent_ncbi_id,name,node_rank from taxonomy WHERE (name_class = 'scientific name' OR name_class = 'authority') \ AND (node_rank = 'no rank' OR node_rank = 'order' OR node_rank = 'family' OR node_rank = 'genus' OR node_rank = 'species' OR node_rank = 'subspecies' \ OR node_rank = 'varietas' OR node_rank = 'forma')")
  - Clean and reformat the NCBI taxonomy
    
    Spermatophyta_plnDB_cleanerV1.1.sh
    Details see the comments inside the script.
    e.g., Remove species with cf. and aff. tags
    Remove Genus_sp._Collection#
    Keep it if the frequency of the genus equals to 1, for place holder
    
    remove_duplicate.py
    Remove duplicate lines
    
    Spermatophyta_sp_authority_format.py
    This script will make a serials of decisions on each line of taxonomic information, and then reformat it based on Authority, species, subspecies, varietas, forma etc.
    Challenge:
    454232,Cucumis_x_Cucurbita,genus
    2005747,Parasponia_x_Trema,genus
    
    Attached decision tree diagram later.
- Mapping tactics
  - NCBI_WCSP_Taxonomy_Merge.R
    - Mapping based on species names
    - If the same name different ids, then decision depends on Authority
    - Syn names pointting the Accepted names
  - Other python script?
[] Data mining

[] Molecular data
- GenBank data mining using PyPHLAWD
- Choose gene with most coverage (the good clusters)
- Sequence length
- Alternative --- supersmartR
  o. Comparison of supersmartR and PyPHLAWD by Dom Bennet, Miao Sun, and Wolf Eiserhardt
Fossil data
- Magallón et al. (2015)
- TimeTree
- Other secondary calibration points
[] Environmental/Traits data
- WorldClim?
- Tropicality?

[] Data cleaning and evaluation
- Sequence data cleaning
- Taxonomic names cleaning
- Topology constraint
- Monophyletic constraint
[] Phylogeny and dating
- Adding constraint tree from the Plant and Fungal Trees of Life Project (PAFTOL) []
- Updated the phylogeny with Pyphlawd data and constraint tree
  - RAxML-NG #more efficient when conducting constrained analyses
  - IQTree #has several branch length options when concatenating
- Building subset trees at family level/or order level (depends on group size)
- All names need to be validated by WCSP
- Dating use treePL
- Calibration points mentioned above (Fossil data) and here
- Multiple secondary calibration points provided by Congruification
[] Tree Grafting/Subsetting
- OpenTree PY Toys
[] Psedu-posterior
- TACT (Taxonomic Addition for Complete Trees):
  A new stochastic polytomy resolution method that uses birth-death-sampling estimators across an ultrametric phylogeny to estimate branching times for unsampled taxa, with taxonomic information to compatibly place new taxa onto a backbone phylogeny.
  
  TACT is also used in Rabosky et al. (2018)
[] How to integrate uncertainty

Other uncertainty test:
- Quartet Sampling from Pease et al. (2018)

Other stuff

Required more thinking
- Species tree approach (ASTRAL) - reconstruct cluster trees first, then species tree
  - Can this species tree has a branch length? How? What's meaning/represnting?
  - Support value? and uncertainty?
- Back-filling unsequenced species
- Disseminating trees:
  - GIT-based repository for trees (PhyloSynth repo)
  - Hosted on Zenodo with doi for each tree
  - Push trees to Github, perform continuous integration tests (Travis) on trees and metadata, trying to catch potential issues (incongruence with taxonomy, changes compared to last version, abnormally long branches etc.)
  - Tree viewer? (e.g., Dendroscope3)
- Service portal
  - Name checking service
  - Coommon namce and nickname (public not necessary know/use the scientific names)
    example: Phylotastic
Subtle:

Lists and tables better use ".csv" format put into "data" folder
References and other Documents put into "reference" folder
Likewise scripts go to "script" foder
Draw a diagram for workingflow
More detailed description for methods used

Last update:
Fri Jan 14 14:33:34 2020

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
data		data
image		image
results		results
script		script
LICENSE		LICENSE
Phylosynth.Rproj		Phylosynth.Rproj
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Phylosandbox

Phylosynth

Some key feactures:

Flexible

Dynamically updated

Portable for different audiences

High quality

Task CheckList

Other stuff

About

Releases

Packages

Languages

License

phylosynth/Phylosandbox

Folders and files

Latest commit

History

Repository files navigation

Phylosandbox

Phylosynth

Some key feactures:

Flexible

Dynamically updated

Portable for different audiences

High quality

Task CheckList

Other stuff

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages