Skip to content
Lars Nauheimer edited this page Sep 6, 2023 · 18 revisions

Welcome to the HybPhaser wiki!

Here you find instructions on how HybPhaser works, how you can install and run it.

The workflow is also published in APPS demonstrating its application link to APPS.

Overview

In the first part (SNPs assessment), HybPhaser is used to measure and assess heterozygous sites in the dataset to detect hybrid accessions as well as putative paralogous genes. It can further be used to optimize the dataset by reducing missing data and removing those putative paralogs. Finally, consensus sequences (HybPhaser) or de novo contigs (HybPiper) can be collated to generate sequence lists ready for alignment and phylogenetic analyses.

In the second part (Clade association), the association of read towards divergent clades is assessed to detect hybrid accessions that can be phased into multiple accessions. The framework phylogeny is used to select suitable references for all major clades across the studies group. The software BBSplit is then used to simultaneously map sequence reads to all clade references recording the proportion of reads matching unambiguously to a single reference. HybPhaser R scripts facilitate running the BBSplit analysis and collate the results into a summary table in order to select suitable accessions that can be phased.

In the third part (Phasing), BBSplit is used to simultaneously map the sequence reads of selected accessions to relevant clade references and distribute reads into new read files representing accessions of phased haplotypes. HybPhaser R scripts facilitate the running of BBSplit and generate a summery table of the mapping stats.

Finally, the newly generated read files can then be assembled using HybPiper and HybPhaser similar to the original read files. HybPhaser R scripts can then be used to combine sequence lists and generate a dataset of phased and non-phased accessions. Phylogenetic analysis of this combined dataset can then reveal the origin of hybrid taxa and reduce phylogenetic conflict for better resolved phylogenies.

Main Principles of the approach

Mapping reads to de novo contigs to detect heterozygosity

Target capture sequence assembly generally relies on de novo assembly, which can lead to chimeric sequences in the presence of divergence between gene variants. While reference mapping can capture heterozygous sites by assigning ambiguity codes, it relies on a reference for sequence assembly. HybPhaser maps the sequence reads back onto the de novo contig and generates a consensus sequence that codes heterozygous sites (SNPs) with ambiguity codes. The assessment of SNPs across samples and genes can then be used to detect hybrids (high proportions of SNPs in most genes compared to other samples) and paralogous genes (higher proportion of SNPs in a single gene compared to other genes).

Phasing of reads by mapping to multiple references simultaneously

Hybrid accessions contain sequence reads from both parental lineages. If there is sufficient divergence between the parental lineages, the reads can be associated with different clades by mapping them simultaneously to multiple references and recording to which they map best. HybPhaser applies this principle in two steps. First the association of reads with clades across the studied group is assessed by mapping reads from all accessions to several references representing clades. Samples that have reads associated with more than one clade can then be phased by mapping only to the relevant references and separating the read files accordingly.

Webinar

More explanations of the concepts used and the workflow itself can be found in this recording of a webinar: BioCommons Webinar on HybPhaser 10/06/21 (apologies for the poor sound quality).

Clone this wiki locally