Some useful functions for gene expression analyses
A function for normalising gene expression datasets in a comparative manner over different sequencing libraries and experiments. It builds on FPKM normalisation and needs FPKM normalised expression values as input. zFPKM has been developed by Hart and colleagues [1]. I here implemented this function for R.
Usage: Input values must have NA's in stead of 0 (or log2(0) = -Inf) and be log2 transformed. On a gene (rows) by library (columns) matrix, use then as in
> zFPKM_data <- apply(FPKM_data, 2, z_fpkm)
Note that this function also applies the cutoff suggested in the study¹ by removing genes below an expression value of log2(0.125) = -3.
[1] Hart et al. 2013: Finding the active genes in deep RNA-seq gene expression studies. BMC Genomics 14:778.
PST is a measure for differentiation in phenotypic quantitative traits similar to Fisher's fixation index FST for genetic differentiation. PST is an approximation to QST, when the additive genetic variance cannot be estimated from breeding experiments [2]. As gene expression constitutes a quantitative trait, quantitative trait framework can be applied to gene expression levels when one wants to compare expression differences between populations or species. In essence, PST compares within- (vW) and between-population variance (vB). The function as implemented here returns PST, vW, vB. It is calculated as in Antoniazza and colleagues [3], but for two populations only as of now.
Usage: Input data should be normalised (e.g., using zFPKM). This assumes a gene (rows) by individual (columns) matrix. First, fixed effects (the two populations to be compared; here called species) and an optional number of random effects (here, sex and tissue) need to be defined like this:
> spec <- rep(c("A","B"),each=10,times=2)
> tis <- rep(c("tis\_1","tis\_2"),each=20)
> sex <- rep(c("f","m","f","m"),each=5,times=2)
The function is then called as follows:
> pst_data <- apply(dataset, 1, Pst, fixed=spec [, tis=tis, sex=sex])
PST, vW, and vB are then retrieved as the three dataframe rows:
> pst <- pst_data[1,]
> vW <- pst_data[2,]
> vB <- pst_data[3,]
Note that this formula allows for negative values in both PST and variance. It is possible to constrain PST at 0 depending on downstream analyses. I chose to not constrain PST, because correlations are better done without such constraint.
[2] Leinonen et al. 2006: Contrasting patterns of body shape and neutral genetic divergence in marine and lake populations of threespine sticklebacks. Journal of Evolutionary Biology 19:1803-1812.
[3] Antoniazza et al. 2010: Local adaptation maintains clinal variation in melanin-based coloration of European barn owls (Tyto alba). Evolution 64:1944-1954.
tau is a measure for tissue specificity of gene expression as described by Yanai and colleagues [4].
Usage: Summarise tissue replicates using mean() or similar. For it to function correctly, NA's must be 0. For a gene (rows) by tissue (columns) matrix, apply then as:
> tau <- apply(dataset, 1, tau)
[4] Yanai et al. 2004: Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification. Bioinformatics 21:650-659.