Skip to content

jthlab/bim-paper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Results for β-Imbalance (BIM): Robust detection of natural selection using a probabilistic model of tree imbalance

  • Here is the paper of this work.
  • If you want to use the software of this paper refer here.
  • Cite this paper here.

Introduction

We developed a fast likelihood-based method to infer natural selection from whole-genome sequences. Unlike the classical neutrality tests, we can model the evolutionary process by considering other factors such as population size histories. Our main contribution is we model the distribution of split sizes in a genealogical tree to measure how imbalance a tree is? A homozygote beneficial mutation causes unbalanced splits in a tree, conversely a heterozygote beneficial mutation causes balanced splits. We model the distribution of these splits by linking it to the effect size and the type of the mutation. Below, there is a rough sketch of this idea. Suppose each elephant symbol represents a gene in each elephant's chromosome (a gene genealogy). Just after the top split, a mutation arises at the left elephant (red). After many generations, both types (gray and red) survive to the present. Our aim is, by looking at today's elephant population modeling the tree topologies. Since we don't see the actual gene genealogies, we do not know where the mutations are located. So we search through possible tree topologies that could have risen the current sample. Naively, if the sizes of these types are quite different we might guess that this location is experiencing a directional selection because environmental factors favor one type and the sample reproduce from that type's lineage. If the sizes of these types are quite close to each other we might guess that location might experience a balancing selection.

Examples

These examples demonstrate simple usages of our method.

Slim Simulations

Slim allows us to simulate sequences under selection. We considered these 4 settings:

You can also check plots folder to see our results.

1000 Genomes Project

How to replicate the results?

We applied our model to 1000 genome project specifically to the tree sequences data. All results in this project can be reproduced by Analysis notebook and median centered Analysis notebook It generates ~30G of the result data and it took ~3 hours to complete everything on a HPC. The replication doesn't need HPC but it would take too much time otherwise. I use this module to send jobs to the cluster. But all srun.run(<terminal job>) can be replaced by ! <terminal job>. In this repository we only published a small amount of those results. Here is a skecth of our data analysis pipeline:

  1. First we estimated population size histories of 26 populations by using a piecewise constant population size model. You can access the notebook here.
  2. Then we calculated the statistics and estimated our splitting parameters using this module. We used 10kb window sizes with 5kb stride.
  3. For median centered versions (this is useful to detect populaiton spesific selection), we caclulated avarage statistics for each window to understand which populations diverge from others. We use this to eliminate shared signals among human populations. Later we call this variables <stat>Cmedi or <stat>Cmean.
  4. For a statistic calculated on a chromosome and a population, we apply a change point detection to isolate the spikes. This reduces the noise and helps us to understand the length of the region that experience the selection.
  5. We will estimate the average statistic each of these segments estimated by change point detection. But variance of the average statistic also has a autocovariance part because of the linkage disequilibrium. To account this phenomena, we calculate the autocovariance function for each statistic population pair.
  6. Then we took the genome-scan p-values for each segment for a population. Segments are coded as p<pop_id>c<chrno>.<segment_order>. They be accessed from here. Along with the p-values plots. The p-values represent <tail of the distribution>_<pop_id>_<stat>.jpg
  7. To compare our beta-splitting paremeters with other selection paper results, we also calculated rank p-values.

Highligted Results

p-values of the genome scan segments around the specified gene:



Z scores of the same segments:



How to browse the results?

  1. See Segmented genome scans notebook to plot the genome-scan p-values for our method. You can either specify the gene or the position on genome.
  2. We also provided a command line program to browse the results. In order to use it, locate your terminal to 1000GenomesProject folder, then call python.
cd 1000GenomesProject
python browser.py

Here is a demo: If you want to search significant segments, go to the p-values plots. Locate the statistic and significant tail along with the population, and then enter the segment to the browser. Significant lower tails imply directional selection and upper tails imply balancing selection for our beta-splitting paramenters.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages