author	startdate	lastdate
Louis du Plessis	2018/07/29	2018/07/29

Summary

Steps to extract sequencing data from XML file, partition into coding and noncoding regions and remove columns with too many unknowns.

Input

Makona_1610_cds_ig.xml: Analysis file from Dudas et al. paper.

Output

Makona_1610_cds_ig.fas: Sequence data from analysis file from Dudas et al. paper in Fasta format.
Makona_1610_cds.fas: Coding region of the alignment.
Makona_1610_ig.fas: Noncoding region of the alignment.
Makona_1610_cds.trimmed.fas: Coding region of the alignment with columns trimmed (0 removed).
Makona_1610_ig.trimmed.fas: Noncoding region of the alignment with columns with more than 95% unknowns removed (36 removed).

Extract sequencing data

Raw alignment has coding and noncoding regions interspersed, without metadata about gene starts and ends. Instead of using Genbank reference alignment use BEASTGen to extract alignment from a BEAST XML file used in Virus genomes reveal factors that spread and sustained the Ebola epidemic, Nature, 2017 (Dudas et al.).

	# Run from templates directory (beastgen struggles when template not in path)

	java -jar ~/Documents/Projects/BEAST1/beast-mcmc/dist/beastgen.jar ../templates/to_fasta.template ../data/sequence/Analyses/Phylogenetic/Makona_1610_cds_ig.xml ../results/datasets/Makona_1610_cds_ig.fas

Split alignment manually at position 14518 (coding and uncoding) using AliView.

Trim sites with too many unknowns

Remove all columns with more than 95% unknowns (> 1529/1610 sequences).

None in coding region
36 in noncoding region

python msahist.py -i ../results/datasets/Makona_1610_cds.fas -o ../results/datasets/
python msahist.py -i ../results/datasets/Makona_1610_ig.fas -o ../results/datasets/

python trimsequences.py -a ../results/datasets/Makona_1610_cds.fas -H ../results/datasets/Makona_1610_cds.hist.csv -o ../results/datasets/ -c 0.95 -p Makona_1610_cds.trimmed
python trimsequences.py -a ../results/datasets/Makona_1610_ig.fas -H ../results/datasets/Makona_1610_ig.hist.csv -o ../results/datasets/ -c 0.95 -p Makona_1610_ig.trimmed

msahist.py

Create a histogram of characters at each site in an alignment as a .csv file. (Assumes a fixed alphabet of possible characters).

trimsequences.py

Remove positions in the alignment marked in a separate fasta file (usually by "X")
Remove all sites in the alignment with more than some cutoff of ambiguous characters (requires a histogram file).

dropcols.py

Used by trimsequences.py to remove sites from the alignment. Can also be used independently.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data_partition.md

data_partition.md

Summary

Input

Output

Extract sequencing data

Trim sites with too many unknowns

Files

data_partition.md

Latest commit

History

data_partition.md

File metadata and controls

Summary

Input

Output

Extract sequencing data

Trim sites with too many unknowns