Code accompanying the manuscript "Long read proteogenomics to connect disease-associated sQTLs to the protein isoform effectors in disease"
The full text can be found in Abood et al. 2024, AJGH
We present a novel generalizable approach that integrates information from GWAS, splicing QTL (sQTL), and PacBio long-read RNA-seq in a disease relevant model to infer the effects of sQTLs on the ultimate protein isoform products they encode
- Processed and input data is found in
- Raw long-read sequencing data is found in GSE224588
- Use setup_r_env.R to set up the R environment with all the needed packages.
- The repo is broken down into three major sections:
- sQTL_colocalization_analysis: This directory contains code needed to replicate Bayesian colocalization analysis with Coloc. Please refer to the README.md within directory for further information
- Reference_transcriptome_generation: This directory contains code to generate the reference transcriptome from long-read RNAseq data. Please refer to the README.md within directory for further information
- Isoseq analysis: from raw reads to isoform classification
- Step 1: Perform analyses on outputs from SQANTI and cDNA_cupcake
- sQTL_to_isoform_mapping
- Step 2: Characterize full-length isoforms (known and novel) containing the colocalized junctions
- Step 3: Add effect size and direction of effect to colocalized junctions
- Step 4: Annotate lead sQTLs and their proxy, follow with positional and enrichment analyses
- Step 5: Differential analyses (DE and DIU) using tappAS
- Step 6: Integrating multiple datasets from the literature and within our analyses to prioritize the isoforms for experimental validation
- Step 7: ORF analyses including: NMD and truncation analysis was performed using a beta version of Biosurfer