This repository contains a snakemake pipeline to benchmark enhancer-gene (E2G) predictions using information from genome-wide association studies (GWAS).
There are two benchmarking methods:
- Overlapping fine-mapped GWAS variants with predicted enhancers
- Using E2G predictions to link GWAS credible sets to causal genes
This README will explain the steps required to run this benchmarking pipeline.
To run the pipeline, first clone the repository. Then, in the config/config_e2g.yaml file, specify the path to your E2G predictions. The file should contain 5 columns: chrom, start, end, gene, and score. The pipeline requires snakemake and conda. Once set up, all the required GWAS data files should also be downloaded through GitHub. The pipeline can be run using the following command:
snakemake --use-conda all
Note: it might be quite slow to run on just 1 core, so it would be good to parallelize it based on your cluster settings.
As input, this pipeline takes enhancer-gene predictions generated from your method of choice. The required input file is a BED file containing the following columns:
- chromosome (ie: chr1, chr2, etc.)
- start coordinate
- end coordinate
- gene name (ie: SAMD11, NOC2L, etc.)
- score (how confident is the model in the enhancer-gene prediction)
For overlapping fine-mapped GWAS variants with predicted enhancers, we are only interested in the enhancer regions themselves, and not their gene links. First, we sort the enhancer regions by genomic position. Then, since there may be overlapping enhancer regions (ie: a region linked to 2 genes, two slightly offset regions from different biosamples, etc.), we merge overlapping enhancer regions together.
The next step is to take every common variant in hg38 (from the 1000 Genomes Project), and mark whether that variant overlaps a predicted enhancer region. For runtime, this is done separately for each chromosome.
The next step is to compute enrichment, precision, and recall values for the presence of fine-mapped variants within enhancer regions. The definitions are as follows:
- Enrichment: (# finemapped variants overlapping enhancers / # common variants overlapping enhancers) / (# finemapped variants / # common variants). In other words, do the enhancer annotations contain a greater proportion of finemapped variants than the genome?
- Precision: (# finemapped variants overlapping enhancers / # common variants overlapping enhancers). In other words, what fraction of the variants overlapping enhancers are also finemapped for a GWAS trait?
- Recall: (# finemapped variants overlapping enhancers / # finemapped variants). In other words, what fraction of finemapped variants are recovered by the enhancer annotations?
This can be done for different sets of traits. In this pipeline, we have currently implemented this for K562-related traits (red blood cell-related traits) and GM1287-related traits (white blood cell-related). For each of these, there are different lists of finemapped variants per trait, and the enrichment, precision, and recall are computed on the level of each list. For each value, standard deviations are computed by performing 100 bootstrap iterations, bootstrapping across chromosomes. P-values for each are computed by using permutation testing, shuffling the overlaps between common variants and enhancer regions. The 100 permutations are used to compute a normal null distribution, from which p-values are computed.
Another way to benchmark enhancer-gene predictions using GWAS is to benchmark how well enhancer-gene predictions can link GWAS credible sets to their target genes. In the ENCODE-E2G paper, one approach presented to do this is to intersect predictions from the ENCODE-rE2G enhancer-gene model with PoPS scores. PoPS scores specify whether a gene is likely to be relevant for a disease. One way to link credible sets to genes with PoPS would then be to take the gene with the highest PoPS score in the vicinity (within 2 MB) of a credible set. Here, be default, we use an approach that intersects the top E2G gene with the top 2 PoPS genes to establish credible set-gene links. We then compute the precision and recall against a silver-standard dataset curated by the ENCODE consortium (see ENCODE-rE2G paper).
All of the enhancer-gene predictions from various methods are avaiable on Synapse using the link provided in the paper. First, this data needs to be downloaded from Synapse to the resources/
folder. (warning: the entire data for this is ~76GB).