author | startdate | lastdate |
---|---|---|
Louis du Plessis |
2018/07/29 |
2018/07/29 |
Steps to extract sequencing data from XML file, partition into coding and noncoding regions and remove columns with too many unknowns.
Makona_1610_cds_ig.xml
: Analysis file from Dudas et al. paper.
Makona_1610_cds_ig.fas
: Sequence data from analysis file from Dudas et al. paper in Fasta format.Makona_1610_cds.fas
: Coding region of the alignment.Makona_1610_ig.fas
: Noncoding region of the alignment.Makona_1610_cds.trimmed.fas
: Coding region of the alignment with columns trimmed (0 removed).Makona_1610_ig.trimmed.fas
: Noncoding region of the alignment with columns with more than 95% unknowns removed (36 removed).
Raw alignment has coding and noncoding regions interspersed, without metadata about gene starts and ends. Instead of using Genbank reference alignment use BEASTGen to extract alignment from a BEAST XML file used in Virus genomes reveal factors that spread and sustained the Ebola epidemic, Nature, 2017 (Dudas et al.).
# Run from templates directory (beastgen struggles when template not in path)
java -jar ~/Documents/Projects/BEAST1/beast-mcmc/dist/beastgen.jar ../templates/to_fasta.template ../data/sequence/Analyses/Phylogenetic/Makona_1610_cds_ig.xml ../results/datasets/Makona_1610_cds_ig.fas
Split alignment manually at position 14518 (coding and uncoding) using AliView.
Remove all columns with more than 95% unknowns (> 1529/1610 sequences).
- None in coding region
- 36 in noncoding region
python msahist.py -i ../results/datasets/Makona_1610_cds.fas -o ../results/datasets/
python msahist.py -i ../results/datasets/Makona_1610_ig.fas -o ../results/datasets/
python trimsequences.py -a ../results/datasets/Makona_1610_cds.fas -H ../results/datasets/Makona_1610_cds.hist.csv -o ../results/datasets/ -c 0.95 -p Makona_1610_cds.trimmed
python trimsequences.py -a ../results/datasets/Makona_1610_ig.fas -H ../results/datasets/Makona_1610_ig.hist.csv -o ../results/datasets/ -c 0.95 -p Makona_1610_ig.trimmed
msahist.py
- Create a histogram of characters at each site in an alignment as a
.csv
file. (Assumes a fixed alphabet of possible characters).
trimsequences.py
- Remove positions in the alignment marked in a separate fasta file (usually by "X")
- Remove all sites in the alignment with more than some cutoff of ambiguous characters (requires a histogram file).
dropcols.py
- Used by
trimsequences.py
to remove sites from the alignment. Can also be used independently.