This code can conduct statistical colocalization between GWAS-significant SNPs and the corresponding eQTLs (for a given tissue) on a gene-by-gene basis. This is ideally suited for situations in which we have a set of genes and tissues of interest (e.g. obtained from running Transcriptome wide Association Studies) and corresponding gene-expression and GWAS summary statistics.
There is no required version of R, but R (>=3.5.0) is preferred.
The following libraries are required to run this sofware:
- coloc
- simsalapar
- optparse
- data.table
- Installed GCTA v1.26 or higher
- Seee https://github.com/hakyimlab/MetaXcan/wiki/Tutorial:-GTEx-v8-MASH-models-integration-with-a-Coronary-Artery-Disease-GWAS for basic steps on harmonizing between GWAS and GTEx v8 datasets so that they are on the same genome build (run_harmonization code in the example also shows the basic protocol for harmonization between a GLGC lipid dataset and GTEx v8 eQTL dataset)
- Note that the following protocol is applied to each combination of gene, trait (quantitative or case/control for a given GWAS dataset), and tissue. We run statistical colocalization between GWAS summary statistics and gene expression summary statistics obtained from GTEx v8.
- For running colocalization, we first identify all the SNPs in the GWAS dataset that are within a specified window (default = 1Mb) from the TSS and TES of the selected gene for a given tissue. Of these, we consider as “lead SNPs” all SNPs in the GWAS dataset with a p-value < a chosen threshold (default = 0.0001) that are at least a specified distance (default = 200 KB) apart from each other.
- For each lead SNP, we collect all the SNPs within a specified distance (default = 200KB on either side of the SNP) in both GWAS and eQTL datasets.
- We subsequently estimate the coloc probability of H3 (alternative hypothesis that eQTL and GWAS associations correspond to independent signals) and H4 (alternative hypothesis that eQTL and GWAS associations correspond to the same signal) for the lead SNP-eGene pair.
- We assume a prior probability that a SNP is associated with (1) lipid phenotype (default=1E-04), (2) gene expression (default=1E=0-4), and (3) both GWAS and gene expression (default=1E-06) for all coloc analyses.
- OPTIONAL: We use the following GCTA-COJO/coloc protocol in the GWAS and eQTL datasets for each lead SNP to identify putative independent secondary associations at the locus.
- For each lead SNP, we obtain p-values conditional on the top-eQTL (p-value < chosen threshold; default = 0.0001) in the eQTL dataset using --cojo-cond option to perform stepwise regression at that locus. We repeat this in the GWAS dataset as well if the top-eQTL also has GWAS p-value < chosen threshold (default = 0.0001).
- In our example dataset, we use 1000 genome EUR as reference dataset to calculate pairwise LD. In reality, it is recommended to use at least 5K individuals in the reference dataset.
- We run coloc between GWAS and eQTL datasets (same parameters as before) at the locus using conditional probabilities obtained from COJO.
- We select the lead SNP with P[H3]<0.5 that has the highest P[H4] for a given eGene and use this as the corresponding P[H3] and P[H4] for the gene.
- We can subsequently filter out all genes whose P[H3]>0.5 for a given tissue (these could be LD contaminated; i.e. GWAS causal variant and eQTL are different but in LD).
#sample GWAS file (pre-harmonization)
SNP BP CHR A1 A2 BETA SE P FREQ N
rs1172982 100230111 1 t c 0.0043 0.0055 0.4689 0.3219 89888
rs1172981 100230197 1 t c 0.0057 0.0103 0.7688 0.06069 89888
rs11166327 100230867 1 c t 0.0053 0.0106 0.7725 0.06069 89888
rs4908018 100234367 1 a c 0.0058 0.0103 0.7887 0.05937 89841
rs2392072 100234743 1 a g 0.0039 0.0058 0.4052 0.3153 86877
#sample eQTL file
gene_id variant_id tss_distance ma_samples ma_count maf pval_nominal slope slope_se
ENSG00000261456.5 chr10_11501_C_A_b38 -62662 124 124 0.116105 0.0400165 -0.187313 0.0909802
ENSG00000261456.5 chr10_11553_G_C_b38 -62610 108 108 0.105058 0.0216029 -0.221475 0.0961113
ENSG00000261456.5 chr10_18924_A_C_b38 -55239 86 86 0.0776173 0.283069 -0.114165 0.106242
ENSG00000261456.5 chr10_44215_T_TTCTG_b38 -29948 13 13 0.0122411 0.608588 0.130235 0.254164
ENSG00000261456.5 chr10_45349_G_A_b38 -28814 21 21 0.0180723 0.579648 0.124365 0.224383
#sample Input dataset
SYPL2 TC Liver GLGC ENSG00000143028 1
ANGPTL3 LDL Liver GLGC ENSG00000132855 1
PSRC1 TC Adipose_Visceral_Omentum GLGC ENSG00000134222 1
PSRC1 LDL Liver GLGC ENSG00000134222 1
PLTP TG Adipose_Visceral_Omentum GLGC ENSG00000100979 20
NBEAL1 TC Adipose_Visceral_Omentum GLGC ENSG00000144426 2
## Pre-saved files needed to run this code (these are downloaded with the code):
#eQTL sample size file (GTEX v8)
Tissue X..RNASeq.and.Genotyped.samples
Adipose_Subcutaneous 581
Adipose_Visceral_Omentum 469
Adrenal_Gland 233
Artery_Aorta 387
Artery_Coronary 213
Artery_Tibial 584
Brain_Amygdala 129
Brain_Anterior_cingulate_cortex_BA24 147
#Gene start stop file (GRCh38)
ENSG00000265692 82290046 82292814 LINC01970 17
ENSG00000260563 82293716 82294910 AC132872.1 17
ENSG00000173762 82314868 82317608 CD7 17
ENSG00000141574 82321024 82334074 SECTM1 17
ENSG00000182459 82359247 82363775 TEX19 17
ENSG00000278964 82362349 82363196 AC132938.4 17
ENSG00000181408 82371400 82375586 UTS2R 17
ENSG00000260011 82381110 82382690 AC132938.1 17
ENSG00000181396 82389210 82418637 OGFOD3 17
ENSG00000264812 82400703 82401382 AC132938.2 17
ENSG00000201239 82417226 82417321 Y_RNA 17
ENSG00000169660 82418318 82442645 HEXD 17
ENSG00000279066 82425498 82427310 HEXD-IT1 17
#Chromosome-wise reference files for COJO in {chr}.bim/.bed/.fam format (1000 Genomes EUR (b38) available in the example dataset)
Trait TC TC TC TC
Tissue Adipose_Visceral_Omentum Adipose_Visceral_Omentum Adipose_Visceral_Omentum Adipose_Visceral_Omentum
GWAS.Data GLGC GLGC GLGC GLGC
CHR 2 2 2 2
GWAS.Lead.SNP rs7581601 rs11694172 rs3769676 rs1174604
GWAS.BP 202633063 202667581 203401620 203556561
GWAS.P 4.494987e-09 1.417578e-11 7.515972e-08 3.270255e-08
GWAS.Beta -0.0217 0.0277 0.0199 0.0210
GWAS.SE 0.0037 0.0041 0.0037 0.0038
GWAS.N 187300 187092 185112 187263
GWAS.A1 C G C C
GWAS.A2 A A T T
GWAS.Freq 0.3338 0.2164 0.3443 0.3496
GWAS.MAF 0.3338 0.2164 0.3443 0.3496
ENSG_gene ENSG00000144426 ENSG00000144426 ENSG00000144426 ENSG00000144426
Gene NBEAL1 NBEAL1 NBEAL1 NBEAL1
Region 202437904:202832555 202469849:202867547 203202545:203600396 203357100:203747335
Gene.Start 203014879 203014879 203014879 203014879
Gene.End 203226378 203226378 203226378 203226378
Gene.Num.SNPs 1584 1584 1584 1584
Gene.Num.GWAS.Sig.SNPs 7 7 7 7
Best.Causal.SNP 2:202797474 2:202797474 2:203206233 2:203556561
TopSNP_eQTL 2:202791831 2:202791831 2:203206233 2:203425314
eQTL.Beta -0.13291200 0.00157705 0.01767270 0.01257750
eQTL.SE 0.0327190 0.0342433 0.0306130 0.0302590
eQTL.P 5.85398e-05 9.63290e-01 5.64066e-01 6.77882e-01
eQTL.N 469 469 469 469
eQTL.A1 A A T T
eQTL.A2 C G C C
eQTL.MAF 0.260684 0.250533 0.357602 0.352878
Coloc.Ratio 0.0012040 0.0007724 0.0179100 0.0522000
Coloc.H0 1.476e-07 9.218e-08 4.173e-04 4.350e-02
Coloc.H1 0.004493 0.002836 0.008311 0.865400
Coloc.H2 3.267e-05 3.239e-05 4.659e-02 4.143e-03
Coloc.H3 0.99430 0.99640 0.92780 0.08239
Coloc.H4 0.0011980 0.0007702 0.0169200 0.0045380
Coloc2.Ratio 0.0012910 0.0007955 0.2025000 0.1898000
Coloc2.H0 4.699e-08 2.820e-08 4.549e-02 4.616e-02
Coloc2.H1 0.0014300 0.0008674 0.9058000 0.9185000
Coloc2.H2 3.277e-05 3.245e-05 1.879e-03 1.385e-03
Coloc2.H3 0.99720 0.99830 0.03733 0.02749
Coloc2.H4 0.0012890 0.0007948 0.0094800 0.0064410
- Note that the output file has been transposed for easy viewing of all columns (here rows).
- Clone the repository.
git clone https://github.com/RitchieLab/Gene-level-statistical-colocalization
- Go to the cloned folder (set as working directory).
cd Gene-level-statistical-colocalization
-
Download and unzip example data https://ritchielab.org/files/Lipid_Pleiotropy_project/gene_level_coloc_example_data.tar.gz and save downloaded data under the cloned folder.
-
tar -xvzf gene_level_coloc_example_data.tar.gz
-
run harmonization (see
run_harmonization
example code underharmonized_gwas
folder) -
Add
gcta64
to the folder path. -
Run
sh example
.
Rscript run_gene_level_coloc.R \
--gwas_data_name="GLGC" \
--trait="LDL" \
--tissue="Liver" \
--gene_of_interest="ENSG00000144426" \
--run_cojo=FALSE \
--cojo_maf=0.01 \
--chr=2 \
--coloc_p1=1e-04 \
--coloc_p2=1e-04 \
--coloc_p12=1e-06 \
--lead_snp_window_size=200000 \
--gene_boundary_window_size=1000000 \
--gwas_p_threshold=0.0001 \
--eqtl_p_threshold=0.0001 \
--gwas_response_type="quant" \
--gwas_file=GLGC_LDL_GWAS_harmonized.txt.gz \
--genes_file=${genes_file} \
--output_folder=${output_folder} \
--reference_folder=${reference_folder} \
--eqtl_folder=${path_gtex} \
--core=10 \
--eqtl_sample_size_file=${eqtl_sample_size_file}
- Note that it is not recommended to run a for loop over genes specified in Input file in practice; one can parallelize runs in an HPC environment.
Following is an explanation of the listed parameters:
- --chromosome The chromosome to which the given gene(s) correspond(s)
- --trait Name of the trait
- --tissue Name of tissue corresponding to the eQTL dataset
- --gwas_data_name Name of GWAS dataset (e.g. GLGC, GIANT)
- --gene_of_interest ENSG_gene(s) of interest from genes_file (ignore decimal point)
- --gene_boundary_window_size The window around the chosen gene from the TSS and TES of the gene (default = 1 Mb)
- --lead_snp_window_size The window around the lead SNP used for colocalization analysis
- --gwas_p_threshold The threshold for GWAS p-value for SNP variants corresponding to the chosen gene(s) (default = 1E-03)
- --eqtl_p_threshold The threshold for eQTL p-value for SNP variants mapping to the chosen eGene(s) (default = 1E-02)
- --gwas_response_type The response variable type for GWAS (quant or cc)
- --coloc_p1 Prior probability a SNP is associated with GWAS trait (default = 1E-04)
- --coloc_p2 Prior probability a SNP is associated with gene expression (default = 0.001)
- --coloc_p12 Prior probability a SNP is associated with GWAS trait and gene expression (default = 1E-06)
- --gwas_file File for tab separated harmomized GWAS summary statistics data (with header) for chosen trait
- --genes_file File (with path) for tab separated list of chosen genes with column names = ENSG_gene, gene_start_position, gene_stop_position, chromosome
- --snps_file External Input file of GWAS varID to be used for analyses, i.e. chr:bp
- --run_cojo Flag to run GCTA-COJO for identifying secondary signals; TRUE/FALSE (default=FALSE)
- --cojo_maf MAF threshold used in GCTA-COJO(default=0.01)
- --cojo_p P-value threshold for gcta-cojo (default = 1E-03)
- --reference_folder Path of the folder with plink files for LD calculation in gcta (should have chromosome number in the filename in .chromosome.bim/bed/fam format)
- --eqtl_folder Path of the GTEX v8 datasets split by chromosome
- --output_folder Path of the output folder
- --core Number of cores to run parallel tasks (default = 10)
- --ld_folder Path of the folder with plink files for LD calculation in gcta (should have chromosome number in the filename in .chromosome.bim/bed/fam format)
- --eqtl_sample_size_file Filename (with path) of sample sizes for eQTL datasets across different tissues; has two columns corresponding to tissue name and sample size
- Run
remove_non_colocalized_genes.R
Rscript remove_non_colocalized_genes.R \
--file_name chr1_GLGC_LDL_Adipose_Subcutaneous_colocProbs.txt \
--coloc_p_h3_threshold 0.5 \
--coloc_p_h4_threshold 0.01 \
--output_folder "output"
Following is an explanation of the listed parameters:
- --file_name Load coloc output file obtained after running run_gcta_and_coloc.R
- --coloc_p_h3_threshold Threshold for minP[H3] per gene across all lead SNPs in the gene (default=0.5)
- --coloc_p_h4_threshold Threshold for maxP[H4] per gene across all lead SNPs in the gene (default=0.01)
- --output_folder Folder where list of genes after filtering out ones with no evidence of colocalization with GWAS SNPs is saved
The manuscript "Unified framework identifies novel replicating links between plasma lipids and diseases from Electronic Health Records across large-scale cohorts" is currently under review.