Skip to content

Pipeline to place taxa in a sequence alignment and phylogeny using NGS reads

License

Notifications You must be signed in to change notification settings

McTavishLab/extensiphy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Extensiphy

DOI

Extensiphy is a pipeline that assembles homologous loci by aligning reads to a reference from a multiple sequence alignment, calls consensus, adds to the existing alignment. Homologous loci may be kept concatenated or split back into individual alignments prior to phylogenetic estimation.

Extensiphy_worlflow

Extensiphy takes an alignment and sets of sequencing reads from query taxa (a). Reads are aligned to a reference sequence and a consensus sequence is called (b). The new sequence is added to the alignment and the updated alignment is used to estimate a phylogeny (c).

Setup and Use

Building and testing your own Extensiphy Docker image

Extensiphy Controls and Flags

Output Files

Phylogenetic Estimation

Additional Software

Dependencies

Reporting Problems

Setup and Use

Docker

The simplest and most hassle free way to run Extensiphy is using Docker. the Building and testing your own Extensiphy Docker image section will review the docker installation instructions. Not recommended on new macs with apple silicon chip - use Anaconda install instructions instead

Anaconda

You can also install the dependencies of Extensiphy using Anaconda. The Anaconda Installation section of this repository will walk through this process in more detail.

Advanced

If you're comfortable installing programs by hand, the Advanced Installation Methods section is for you. This is largely only tested on Linux (Ubuntu) operating systems.

Extensiphy Tutorial

We recommend you run through the Extensiphy tutorial for a more in-depth walkthrough of Extensiphy's features. The tutorial will walk through how to run Extensiphy using different data types and options. You can copy code snippets into your terminal window.

Additional Tutorials

To help explain some of the jargon (technical words and terms) that goes along with bioinformatics programs, we've also included some other tutorials.

  • The command line tutorial will help you get a grasp on how to find files in your computer using the shell/terminal/command line (you'll be a hacker in no time!).
  • The suffix tutorial will help clarify the read suffix arguments.

Building and testing your own Extensiphy Docker image

First we'll building the Docker image and a container to test your Extensiphy installation. Then we'll connect your data to a new container so you can begin updating your own alignments!

  1. Make sure you have Docker installed according to your operating system.

  2. To pull the Docker build of Extensiphy, run this command.

docker pull mctavishlab/extensiphy
  1. We'll build your Extensiphy Docker container using this command.
  • -i makes the container interactive.
  • -t specifies the image to use as a template.
  • --name specifies the container name.
docker run --name ep_container -i -t mctavishlab/extensiphy bash

Your command line prompt should change to indicate that you are now working inside your Extensiphy container.

You can exit the docker container by typing exit.

To restart it and return to interactive analyses, run:

docker container restart ep_container
docker exec -it ep_container bash

Quick test run

If you have followed one of the install approaches above, you are now ready to try a test run!
We'll use the combo.fas alignment file as our starting alignment. combo.fas can be found in:

/extensiphy/testdata/combo.fas

Now, either from the docker container, your anaconda env, or from the directory where you installed Extensiphy, run:

./extensiphy.sh -a ./testdata/combo.fas -d ./testdata -1 _R1.fq -2 _R2.fq -u PHYLO -o EP_output

This is a simple run on three paired end read samples, which are found in the directory extensiphy/testdata

  • The -a flag provides the path to the existing alignment to update.
  • The -d flag provides the path to your directory of fastq files.
  • The -1 and -2 flags specify the filename endings for each of the readfiles. (defaults are _R1.fq and _R2.fq, more info at https://github.com/McTavishLab/extensiphy/blob/main/tutorial/suffix_tutorial.md)
  • The -u flag specfies what analysis to run. Here we are building a phylogeny. (default is ALIGN, building an alignment only.)
  • The -o flag specifies the output directory. (default is EP_output)

Once Extensiphy has finished running on the test data, you should see a lines saying:

Alignment file is: /project/extensiphy/EP_output/RESULTS/extended.aln

Tree file is: /project/extensiphy/EP_output/RESULTS/RAxML_bestTree.consensusFULL

  • If you did not get this message, you'll have to check output log ep_dev_log.txt to learn more about the issue before proceeding.

We just added 3 new taxa to a starting multiple sequence alignment and obtained a tree that includes these new taxa.

  • If you are using docker - exit the container by typing
exit
  • You can copy the extended tree to your local directory using:
docker cp ep_container:/project/extensiphy/EP_output/RESULTS/RAxML_bestTree.consensusFULL .
  • For a deeper walk through, take a look through the tutorial.

  • To get right down to business and update your own alignment, continue to the next section.

Using Extensiphy on your own data.

We'll use brackets [] to indicate variables you should replace with your own files or paths. Replace the [stuff inside the brackets] with the appropriate paths and folder names you've used so far.

If you have installed Extensiphy locally, you can just pass in the paths to your data, and run the analysis.

./extensiphy.sh -a [path to your_input_alignment] -d [path to your_directory_of_reads] -1 [r1_suffix] -2 [r2_suffix] -u [either PHYLO or ALIGN, depending on if you want a phylogeny or just and alignment] -o [your_output_dir]

If you are using docker, it is simplest to link your data directory to a new container.

Put the input alignment and raw reads you want to align in a directory. e.g. [my_data_dir]

We'll build a new Extensiphy Docker container and connect the directory containing your data to the container.

docker run --name ep_container_link -i -t -v [/path/to/my_data_dir]:/project/linked_data mctavishlab/extensiphy bash

This shares the 'my_data_dir' folder between your operating system and the docker container. (In this example it is named "my_data_dir" locally and "linked_data" in your docker container, but you can name them the same thing in both places if you prefer.)

Now you can run extensiphy.sh as earlier but we'll specify the directory where your data is located.

./extensiphy.sh -a /project/linked_data/[alignment_file] -d /project/linked_data -1 [suffix_1] -2 [suffix_2] -o linked_data/[output_dir_name]

By putting the outputs into the linked directory, you can access them directly through your operating system without having to copy them.

Extensiphy Controls and Flags:

Required flags

- (-a) alignment in fasta format,
- (-d) directory of paired end fastq read files for all query taxa,
- (-u) produce only an updated alignment or perform full phylogenetic estimation (ALIGN or PHYLO) (DEFAULT: ALIGN),

Optional flags

- (-t) tree in Newick format produced from the input alignment that you wish to update with new sequences or specify NONE to perform new inference (DEFAULT: NONE),
- (-1, -2) suffix (ex: R1.fastq or R2.fastq) for both sets of paired end files. Required if suffix is different than default (DEFAULTS: R1.fq and R2.fq),
- (-m) alignment type (SINGLE_LOCUS_FILES, PARSNP_XMFA or CONCAT_MSA) (DEFAULT: CONCAT_MSA),
- (-o) directory name to hold results (DEFAULT: creates EP_output),
- (-r) Selected a reference sequence from the alignment file for read mapping or leave as default and the first sequence in the alignment will be chosen (DEFAULT: RANDOM),
- (-p) number of taxa to process in parallel,
- (-c) number of threads per taxon being processed,
- (-e) set read-type as single end (SE) or pair-end (PE) (DEFAULT: PE),
- (-g) output format (CONCAT_MSA or SINGLE_LOCUS_FILES) (DEFAULT: CONCAT_MSA),
- (-s) specify the suffix (.fa, .fasta, etc) (DEFAULT: .fasta),
- (-b) bootstrapping tree ON or OFF (DEFAULT: OFF)
- (-i) set whether to clean up intermediate output files to save disk space)(KEEP, CLEAN)(DEFAULT: KEEP)

 if using single locus MSA files as input,
- (-f) csv file name to keep track of individual loci when concatenated (DEFAULT: loci_positions.csv),
- (-n) Set size of locus minimum size cutoff used as input or output (Options: int number)(DEFAULT: 700)     

Output Files

  • Concatenated alignment file: found in your output folder
[OUTDIR]/RESULTS/extended.aln
  • Phylogeny in newick file format (if you selected to output a phylogeny): found in your output folder
[OUTDIR]/RESULTS/RAxML_bestTree.consensusFULL
  • Taxon specific intermediate files (if you kept intermediate files): found in your output folder
[OUTDIR]/[TAXON_NAME]

.sam, .bam and .vcf files can be found in here for any additional analyses.

Phylogenetic Estimation

Extensiphy is targeted towards producing an updated sequence alignment and allowing users to use the alignment with any phylogenetic estimation method they choose. We provide a phylogenetic estimation as a convenience but you are in no way locked into using this estimation method. You can simply take the alignment output by Extensiphy and use that alignment as input for your favorite estimation method. A few notes:

  • Currently, phylogenetic estimation with Extensiphy is performed by RAxML using the GTR model. This setting cannot be changed at this time.

  • To avoid estimating a phylogeny using the packaged RAxML program and settings, use the -u ALIGN option when running Extensiphy.

  • To use an alternative method of phylogenetic estimation, when an Extensiphy run is complete, the [OUTDIR]/RESULTS/extended.aln file should be used as input for your chosen estimation method.

  • If you wish to use multiple single locus alignment files as input to another estimation method, please see the tutorial for more information on updating single locus alignments.

Additional Software

Extensiphy is the primary program of this software package. However, another piece of software is included: Gon_phyling. Gon_phyling is a piece of software for building starting alignments and phylogenies when you only have raw-read fastq files. Gon_phyling isn't the focus software but we provide it in case you might find it useful. Checkout the program and README in the gon_phyling directory.

Dependencies

Dependencies (Separate programs you'll need to install):

  1. Python 3
  2. bwa-mem2
  3. RAxMLHPC
  4. Seqtk
  5. Samtools
  6. Bcftools
  7. Fastx toolkit
  8. Dendropy

Reporting Problems

Software will have bugs. We try to address issues with Extensiphy as they arise. If you run into an issue, please report it using Extensiphy's Issue Tracker. You can also search the Issue Tracker for solved fixes for previously identified issues. Finally, you can contact us at [email protected] to discuss any problems with installing or running Extensiphy.

About

Pipeline to place taxa in a sequence alignment and phylogeny using NGS reads

Resources

License

Stars

Watchers

Forks

Packages

No packages published