diff --git a/.DS_Store b/.DS_Store index 3746d5f..f1c7779 100644 Binary files a/.DS_Store and b/.DS_Store differ diff --git a/.Rhistory b/.Rhistory index 1c0e755..e1a6a39 100644 --- a/.Rhistory +++ b/.Rhistory @@ -292,3 +292,17 @@ print(paste0("The number of SAMPLES in data: ", length(c(SeqArray::seqGetData(gd print(paste0("The number of SNPs in data: ", length(c(SeqArray::seqGetData(gdsin, "variant.id"))))) vcftools --vcf POP_02_RADseq/OL_subset.vcf --missing-indv --out POP_02_RADseq/OL_subset vcftools --vcf POP_02_RADseq/OL_subset.vcf --missing-indv --out POP_02_RADseq/OL_subset +library(SeqArray) # efficient storage and filtering of genomic data +library(tidyverse) # plotting data formatting and manipulation +library(SNPRelate) # PCA and other popgen analyses +gdsin = SeqArray::seqOpen(filename.gds) +filename = "OL_subset" #replace with y our file name +filename.gds = paste0("POP_02_RADseq/", paste0(filename, ".gds")) +filename.vcf = paste0("POP_02_RADseq/", paste0(filename, ".vcf")) +# 1 . Convert VCF to GDS +SeqArray::seqVCF2GDS(vcf.fn = filename.vcf, out.fn = filename.gds, storage.option="ZIP_RA") +gdsin = SeqArray::seqOpen(filename.gds) +sample.ids = seqGetData(gdsin, "sample.id") +knitr::include_graphics("ADMIN_01_submissions_instructions_files/Rivera_etal_fig.png") +getwd +getwd() diff --git a/ADMIN_01_submissions_instructions.Rmd b/ADMIN_01_submissions_instructions.Rmd deleted file mode 100644 index d755314..0000000 --- a/ADMIN_01_submissions_instructions.Rmd +++ /dev/null @@ -1,369 +0,0 @@ ---- -title: "Instructions for developing new pages for the MarineOmics webpage" -bibliography: "ADMIN_01_submissions_instructions_files/common-bib_01.bib" -editor_options: - markdown: - wrap: 72 ---- - -```{r #setup, echo = FALSE, message = FALSE} -library(knitr) -knitr::opts_chunk$set( - fig.height = 5, - fig.width = 10, - cache = TRUE -) -library(knitcitations) -library(kableExtra) -opts_chunk$set(fig.width = 10, - fig.height = 5, - cache = TRUE) -cite_options(citation_format = "pandoc", max.names = 3, style = "html", - hyperlink = "to.doc") -``` - -## Overview - -If you are interested in submitting new content to the -MarineOmics.github.io website, this is the guide for you! While the -website managers will work with you to help prepare and publish your new -page, this guide can be a useful resource as you develop your content. -Below we provide general instructions as well as examples of how to -build specific features (eg., tables, code chunks). Prior to developing -new pages, please contact the current Website Manager Jason Johns -(jasonjohns[at]ucsb.edu). Do not hesitate to reach out with any other -questions, comments, or features you'd like added to this guide. - -The MarineOmics website is published through Github Pages directly from -the [MarineOmics github -repository](https://github.com/MarineOmics/marineomics.github.io). In -short, each page of the website is rendered as an .html from an -RMarkdown (.Rmd) document. The two most important files for publishing -the general website are the \_site.yml and the index.Rmd/index.html. - -\*\*\_site.yml\*\*: the 'under the hood' backbone of the website, where -things like the visual appearance of the website and the navigation bar -are specified - -**index.html**: rendered from index.Rmd - the home page from which all -other pages branch - -\* The website manager will modify the index.html and .site_yml files -for your page when you are ready to publish it, so you don't need to -know exactly how they work, but it's good to know what they are at a -basic level. - -
If you are interested in submitting new content to the -MarineOmics.github.io website, this is the guide for you! While the -website managers will work with you to help prepare and publish your new -page, this guide can be a useful resource as you develop your content. -Below we provide general instructions as well as examples of how to -build specific features (eg., tables, code chunks). Prior to developing -new pages, please contact the current Website Manager Jason Johns -(jasonjohns[at]ucsb.edu). Do not hesitate to reach out with any other -questions, comments, or features you’d like added to this guide.
-The MarineOmics website is published through Github Pages directly -from the MarineOmics -github repository. In short, each page of the website is rendered as -an .html from an RMarkdown (.Rmd) document. The two most important files -for publishing the general website are the _site.yml and the -index.Rmd/index.html.
-**_site.yml**: the ‘under the hood’ backbone of the website, where -things like the visual appearance of the website and the navigation bar -are specified
-index.html: rendered from index.Rmd - the home page -from which all other pages branch
-* The website manager will modify the index.html and .site_yml files -for your page when you are ready to publish it, so you don’t need to -know exactly how they work, but it’s good to know what they are at a -basic level.
-The rendered .Rmd and .html files for each page on this site can be -seen in the MarineOmics -github repository. It is possible to use .Rmd files in base R, but -we recommend using RStudio. If you don’t have much experience with -RStudio, it’s quite intuitive with a shallow learning curve and many -existing online resources for using it. It is possible to use .Rmd files -in base R, but we recommend using RStudio.
-Note: it’s a good idea to look at existing .Rmd scripts -(including the -one that codes for this page) to see examples of how a particular -feature can be accomplished, such as citing references, linking a -website, building a table, incorporating figures, etc. We will -explicitly go over some of those features below, but there are many -other bells and whistles that can be explored through online -resources.
-Here are some general RMarkdown tutorials that we like to reference:
-
- Chapter
-10 from RMarkdown the Definitive Guide: Xi, Allaire, & Grolemund
-2022
- R
-Markdown Websites: Garrett Grolemund
One of the strengths of .Rmd documents is their versatility. They can -hold everything from code (not only in R but in other languages), -formatted text, figure outputs, and more. Basically, code must be -enclosed within ‘code chunks’ and anything outside of code chunks will -be read as text, similar to a basic text file.
-Below is a basic R code chunk which can be inserted by clicking the
-green +C button on the RStudio toolbar or typing the keyboard shortcut:
-
```{r, echo = TRUE, include = TRUE}
-
-#Here's a simple linear model in R as an example
-my_lm <- lm(y ~ x, data = df)
-
-```
-Whether code chunks and/or their output are included in the knitted
-.html can be modified with the logical commands “echo” and “include”.
-
* Other options exist for code chunk output which can be explored -online and in other .Rmd scripts in this repository.
-This is the chunk where we establish global settings for the entire -document. This is done with the command from the knitr package -“opts_chunk$set” as shown in the example below. This command is -typically included in the first code chunk, where we also load our page -formatting packages and adjust the settings for citations. When a new -.Rmd is made, this command is automatically included in the first chunk, -which is named “setup” by default. The name of a code chunk can be -specified after the r in ```{r ____}.
- -```{r setup}
-knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE, cache = TRUE)
-# load page formatting packages
-library(knitr)
-library(knitcitations)
-library(kableExtra)
-knitr::opts_chunk$set(fig.width = 10, fig.height = 5) #setting the dimensions for figure outputs in a different line of code for cleanliness
-cite_options(citation_format = "pandoc", max.names = 3, style = "html",
- hyperlink = "to.doc") #setting format for in-text citations for the references section at the bottom of the page
-```
-* Note the “cache =” option. When set to “true” R will cache all of -the output created from knitting the RMarkdown document. This is -especially useful for pages with heavy computation that take a long time -to knit.
-In addition to R code, we can also run code such as bash shell script
-in the R markdown file by modifying the code chunk header as below:
-
```{bash, eval = T}
-
-# Here's a simple line of bash as an example
-echo "Hi world"
-
-```
-Options abound for formatting text in .Rmd documents using Pandoc Markdown, and many of
-these can be found with a quick search, but we will go over a few basics
-here.
In-text citations can be made throughout the document that will
-automatically add a reference to your reference section via the
-following steps:
---
-title: "Instructions for developing new pages for the MarineOmics webpage"
-bibliography: common-bib_01.bib
----
-After following these step, you can cite them in the RMarkdown -document. Let’s say you have a paper you want to cite by LastName et -al. 2023 and the paper’s ID in your .bib file is ‘LastName2023-xy’. We -can reference this paper in-text by adding [@LastName2023-xy] to the end -of a sentence. This will produce a reference such as the one below for -Tripp et al. 2017.
-syntax in .Rmd text (not in a code -chunk):
-RADseq can produce insightful analyses even in the absence of data, if missing values are handled properly [@Tripp2017-cp].
-knitted output:
RADseq can produce insightful analyses
-even in the absence of data, if missing values are handled properly
-(Tripp et al. 2017).
If you will be using any R packages or external software for your
-content, it’s a good idea to highlight them toward the beginning of your
-page, including code for installing them. Here’s an example from the
-RADseq tutorial:
If you would like to run the R code
-examples that are scattered throughout the guide (recommended but not
-required!), you will need to install some R packages. Only need to run
-this code once:
install.packages("tidyverse")
-if (!requireNamespace("BiocManager", quietly = TRUE))
- install.packages("BiocManager")
-#the BiocManager package facilitates the installation of packages from the Bioconductor project
-BiocManager::install("SeqArray")
-BiocManager::install("SNPRelate")
-
These external software are referenced in the tutorial:
-
If your page includes any code that takes a long time to run, please -set “cache = TRUE” in the knitr::opts_chunk$set command in your global -code chunk.
-text in .Rmd (not in a code chunk):
-| Term A | Term B | Term C | Term D | Term E | Term F | Term G |
-| :--- | :--- | :--- | :--- | :--- | :--- | :--- |
-| xxx | xxx | xxx | xxx | xxx | xxx | xxx |
-| yyy | yyy | yyy | yyy | yyy | yyy | yyy |
-| zzz | zzz | zzz | zzz | zzz | zzz | zzz |
-knitted output:
-Term A | -Term B | -Term C | -Term D | -Term E | -Term F | -Term G | -
---|---|---|---|---|---|---|
xxx | -xxx | -xxx | -xxx | -xxx | -xxx | -xxx | -
yyy | -yyy | -yyy | -yyy | -yyy | -yyy | -yyy | -
zzz | -zzz | -zzz | -zzz | -zzz | -zzz | -zzz | -
You can generate a Markdown-formatted table using a free online -editor, such as this -one. The kableExtra -package has a lot of functionality for fancier tables.
-To insert an image, you need the image file to also be uploaded to
-the Github repositority. We store the images hosted on the website in
-the images/
folder. For any image, please make sure you
-have permission to share the image and that it is appropriately
-cited.
code chunk in .Rmd:
-```{r, echo = TRUE, include = TRUE, out.height = "40%", fig.align = "center", fig.cap="Figure from Rivera et al. 2021"}
-
-knitr::include_graphics("ADMIN_01_submissions_instructions_files/Rivera_etal_fig.png")
-
-```
-knitted output:
-knitr::include_graphics("ADMIN_01_submissions_instructions_files/Rivera_etal_fig.png")
-Equations follow LaTeX style formatting.
-For illustrative purposes, here is a simple and famous equation.
-syntax in .Rmd:
$$
-E = mc^2
-$$
-knitted output: \[
-E = mc^2
-\]
As an example of a more complex equation, here is an equation for the
-t-statistic:
syntax in .Rmd:
$$
-t = \frac{\overline{x_{1}}-\overline{x_{2}}}
-{\sqrt{(s^2(
-\frac{1}{n_{1}} + \frac{1}{n_{2}}
-))}}
-$$
-knitted output: \[ -t = \frac{\overline{x_{1}}-\overline{x_{2}}} -{\sqrt{(s^2( -\frac{1}{n_{1}} + \frac{1}{n_{2}} -))}} -\]
-After you have finished editing your R markdown file, you can “knit” -it to an HTML file by selecting the Knit button at the top of RStudio. -You can review this HTML file to get an idea of what the page will look -like on the website (minus the navigation bar and other site-specific -formatting). You will need to upload both the R markdown file and the -HTML file to Github prior to you page being added to the website.
-If you are interested in submitting new content to the -MarineOmics.github.io website, we have a guide for you! While -the website managers will work with you to help prepare and publish your -new page, this guide can be a useful resource as you develop your -content. The guide provides general instructions as well as examples of -how to build specific features (eg., tables, code chunks). Prior to -developing new pages, please contact the current Website Manager Jason -Johns (jasonjohns[at]ucsb.edu). Do not hesitate to reach out with any -other questions, comments, or features you’d like added to the -guide.
-To facilitate discussions about practices and pitfalls in applying -’omics to non-model systems, we hosted Q & A panel seminars with -experts on topics ranging from RADseq, population genomics with WGS, and -transcriptomics in May-June 2021. 3-5 expert panelists discussed each -topic, with questions coming from the working group and the audience. -Videos are available on YouTube!
-Date | -Topic | -Panelists | -Video | -
---|---|---|---|
May 7 | -GBS/RADseq genotyping pipelines | -Melanie -LaCava, Alicia -Mastretta-Yanes, Isaac -Overcast, Jon -Puritz | -View video | -
May 21 | -WGS genotyping (low and high coverage) | -Jeffrey Good, -Mikhail “Misha” Matz, Noah Reid, Tim Sackton, Nina Overgaard -Therkildsen | -View Video | -
June 4 | -Population genomic analyses and filtering data | -Zach Gompert, Morgan Kelly, Katie -Lotterhos, Jeff -Ross-Ibarra | -View -Video | -
June 18 | -RNAseq (experimental design through differential -expression) | -Ana Conesa, Sarah Davies, Martin Hölzer, Joanna Kelley, Harold Pimentel | -View -Video | -
The purpose of this tutorial is to provide an example of taking -Bisulfite treated DNA sequence data through identification of -methylation at the loci and gene level. The principle behind creating -bisulfite treated DNA libraries is that when cytosines are not -methylated they will be converted to uracil, which ultimately result in -thymine. When aligning to a reference genome the percent methylation at -a given CpG loci is determined by examine the ratio to thymines and -cytosines. For example if there is a given CpG loci with 30% of the -alignment with the cytosine on the reference being and thymine, one -would consider this cytosine loci (CpG) to be 70% methylated.
-For a good review of epigenetic mechanisms including DNA methylation -in marine organisms see Eirin-Lopez and Putnam -(2019). Trigg et al. (2022) offers -a comparison of three methods for quantifying DNA methylation at single -base-pair resolution using whole genome bisulfite sequencing (WGBS), -reduced representation bisulfite sequencing (RRBS), and methyl-CpG -binding domain bisulfite sequencing (MBDBS). The tutorial below is based -on WGBS, though the general workflow would be consistent.
-An example of what raw data might look like from a marine -invertebrate is shown. Note that DNA methylation patters are diverse -across taxa with most marine invertebrates demonstrating a mosiac -pattern and global methylation at CpG loci is approximately -15%.
-Following trimming one would expect to see something similar to this. -This particular data set was trimmed with the following parameters:
-/gscratch/srlab/programs/TrimGalore-0.4.5/trim_galore \
---output_dir /gscratch/scrubbed/strigg/analyses/20200311/WGBS_MBD \
---paired \
---fastqc_args \
---outdir /gscratch/scrubbed/strigg/analyses/20200311/WGBS_MBD/FASTQC \
---threads 28 \
---illumina \
---clip_R1 10 \
---clip_R2 10 \
---three_prime_clip_R1 10 \
---three_prime_clip_R2 10 \
---path_to_cutadapt /gscratch/srlab/programs/miniconda3/bin/cutadapt \
-/gscratch/scrubbed/sr320/froger-raw/00_fastq/Meth17_R1_001.fastq.gz \
-
-The software used as part of this tutorial is Bismark -Bisulfite Mapper. As with any software it is best to be -famililar with the User -Manual. Thus content below will provided with the assumption that -the reader has read the manual and is meant to serve as guidance based -on experience working with marine invertebrates.
-The first step in the process in preparing the genome. Example -code:
-# Directories and programs
-bismark_dir="/programs/Bismark-0.21.0"
-bowtie2_dir="/programs/bowtie2-2.3.4.1-linux-x86_64/"
-genome_folder="/where/the/fastafile/lives/"
-
-${bismark_dir}/bismark_genome_preparation \
---verbose \
---parallel 28 \
---path_to_aligner ${bowtie2_dir} \
-${genome_folder}
-You should expect to prepared genome with new directories -indluding
-./Bisulfite_Genome
-./Bisulfite_Genome/GA_conversion
-./Bisulfite_Genome/CT_conversion
-For doing the alignment of the trim reads to the genome the following -code structured is used
-find ${reads_dir}*_R1_001_val_1.fq.gz \
-| xargs basename -s _R1_001_val_1.fq.gz | xargs -I{} ${bismark_dir}/bismark \
---path_to_bowtie ${bowtie2_dir} \
--genome ${genome_folder} \
--p 4 \
--score_min L,0,-0.6 \
---non_directional \
--1 ${reads_dir}{}_R1_001_val_1.fq.gz \
--2 ${reads_dir}{}_R2_001_val_2.fq.gz \
--o Mcap_tg
-This will create bam files (sequence alignment files)
-Files are then deduplicated if they are whole genome bisulfite -treated samples. This command will deduplicate the Bismark alignment BAM -file and remove all reads but one which align to the the very same -position and in the same orientation. This step is recommended for -whole-genome bisulfite samples, but should not be used for reduced -representation libraries such as RRBS, amplicon or target enrichment -libraries.
-find *.bam | \
-xargs basename -s .bam | \
-xargs -I{} ${bismark_dir}/deduplicate_bismark \
---bam \
---paired \
-{}.bam
-Methylation levels are then extracted using the
-bismark_methylation_extractor
. For example:
${bismark_dir}/bismark_methylation_extractor \
---bedGraph \
---counts \
---comprehensive \
---merge_non_CpG \
---multicore 28 \
---buffer_size 75% \
-*deduplicated.bam
-This will create a file with the suffix
-deduplicated.bismark.cov.gz
(example shown below in
-uncompressed format). And while a bedGraph file is generated, in this
-workflow a bedGraph will be generated later that uses information from
-merged strands (as opposed to this output).
NC_035784.1 141 141 37.5 3 5
-NC_035784.1 142 142 100 2 0
-NC_035784.1 155 155 70 7 3
-NC_035784.1 156 156 100 2 0
-NC_035784.1 291 291 0 0 2
-NC_035784.1 292 292 0 0 3
-NC_035784.1 313 313 0 0 1
-NC_035784.1 314 314 66.6666666666667 2 1
-NC_035784.1 470 470 66.6666666666667 4 2
-NC_035784.1 611 611 0 0 4
-column organization of the file
-<chromosome> <start position> <end position> <methylation percentage> <count methylated> <count unmethylated>
-A genome-wide cytosine report is generated, with this code as an -example:
-find *deduplicated.bismark.cov.gz \
-| xargs basename -s _trimmed_bismark_bt2.deduplicated.bismark.cov.gz \
-| xargs -I{} ${bismark_dir}/coverage2cytosine \
---genome_folder ${genome_folder} \
--o {} \
---merge_CpG \
---zero_based \
-{}_trimmed_bismark_bt2.deduplicated.bismark.cov.gz
-Generating a file
-.CpG_report.merged_CpG_evidence.cov
NC_035785.1 217 219 100.000000 17 0
-NC_035785.1 523 525 87.500000 7 1
-NC_035785.1 556 558 50.000000 5 5
-NC_035785.1 727 729 100.000000 16 0
-NC_035785.1 1330 1332 0.000000 0 2
-NC_035785.1 1403 1405 0.000000 0 2
-NC_035785.1 1494 1496 66.666667 2 1
-NC_035785.1 1747 1749 100.000000 8 0
-NC_035785.1 2024 2026 100.000000 24 0
-NC_035785.1 2054 2056 93.333333 14 1
-From here it could be useful to do some simple file reorganizaion to -obtain bedgraph or tab files for downstream analysis (eg bedtools, -GLMs).
-#creating bedgraphs post merge
-for f in *merged_CpG_evidence.cov
-do
- STEM=$(basename "${f}" .CpG_report.merged_CpG_evidence.cov)
- cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 10) {print $1, $2, $3, $4}}' \
- > "${STEM}"_10x.bedgraph
-done
-
-for f in *merged_CpG_evidence.cov
-do
- STEM=$(basename "${f}" .CpG_report.merged_CpG_evidence.cov)
- cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 5) {print $1, $2, $3, $4}}' \
- > "${STEM}"_5x.bedgraph
-done
-
-#creating tab files with raw count for glms
-for f in *merged_CpG_evidence.cov
-do
- STEM=$(basename "${f}" .CpG_report.merged_CpG_evidence.cov)
- cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 10) {print $1, $2, $3, $4, $5, $6}}' \
- > "${STEM}"_10x.tab
-done
-
-for f in *merged_CpG_evidence.cov
-do
- STEM=$(basename "${f}" .CpG_report.merged_CpG_evidence.cov)
- cat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 5) {print $1, $2, $3, $4, $5, $6}}' \
- > "${STEM}"_5x.tab
-done
-It is sometimes useful to sort BAM files for downstream analysis (eg -methylkit, IGV)
-# Sort files for methylkit and IGV
-find *.bam | \
-xargs basename -s .bam | \
-xargs -I{} ${samtools} \
-sort --threads 28 {}.bam \
--o {}.sorted.bam
-
-# Index sorted files for IGV
-find *.sorted.bam | \
-xargs basename -s .sorted.bam | \
-xargs -I{} ${samtools} \
-index -@ 28 {}.sorted.bam
-Finding your subset of genes of interest
-Term A | -Term B | -Term C | -Term D | -Term E | -Term F | -Term G | -
---|---|---|---|---|---|---|
xxx | -xxx | -xxx | -xxx | -xxx | -xxx | -xxx | -
yyy | -yyy | -yyy | -yyy | -yyy | -yyy | -yyy | -
zzz | -zzz | -zzz | -zzz | -zzz | -zzz | -zzz | -
For illustrative purposes, here is a simple and famous equation
-\[ -E = mc^2 -\]
-As an example of a more complex equation, here is an equation for the -t-statistic
-\[ -t = \frac{\overline{x_{1}}-\overline{x_{2}}} -{\sqrt{(s^2( -\frac{1}{n_{1}} + \frac{1}{n_{2}} -))}} -\]
-Functional enrichment tests applied to mRNA analyses often impose -significant biases (Bleazard, Lamb, and -Griffiths-Jones 2015).
-```{r setup, include = FALSE}
-
-# Set root directory... in this draft, ensure that you have the same path
-knitr::opts_knit$set(root.dir = '~/Documents/GitHub/marineomics.github.io/')
-
-```
-Below is an example of what should be the first, visible code -chunk
-## Unhash and run the code below if you believe you may need to install the packages loaded below
-# Edit code below to list packages that are used in this draft... at present these are just examples
-
-if (!require("BiocManager", quietly = TRUE))
- install.packages("BiocManager")
-
-BiocManager::install(c("DESeq2","edgeR","arrayQualityMetrics"))
-
-# Load packages
-library(tidyverse)
-Below are the markdown-formatted references for our document derived -from the
-Studies of molecular responses to environmental change increasingly -employ multifactorial experimental designs incorporating different -developmental stages, stressors, populations, or non-linear dynamics -that resolve interactions between variables that shape expression, -physiology, and performance. After an experiment during which a study -organism was exposed to abiotic variation and perhaps other variables, -researchers interested in how gene expression correlates with these -variables will often extract RNA across treatment groups, prepare cDNA -libraries for sequencing, and conduct an analysis of differential -expression (DE) across variables of interest after mapping and counting -RNA-seq reads. DE analysis is achieved by fitting RNA-seq read count -data to statistical models, often using tried-and-true DE packages. Most -packages do not have the flexibility to accommodate a number of common -experimental design components and it can be difficult to navigate their -functionality for fitting multivariate models. When these packages fall -short, determining how to fit customized multifactorial models yourself -can be an additional challenge. Many great walkthroughs exist for -conducting DE analysis across using single-variable experimental -designs, but there is a lack of guidance on multifactorial approaches -for testing for DE and modelling gene expression. Regardless of whether -you have RNA-seq data on hand or you’re in the process of planning your -experiment, this walkthrough will help you navigate what DE packages and -model fitting approaches are best suited to your experimental -design.
-This walkthrough is for both intermediate and expert practitioners of -RNA-seq analyses, and most of all, those new to multifactorial RNA-seq. -We have written this page with the assumption that readers have a basic -familiarity with model fitting and statistical analysis, as well as next -generation sequencing technologies, RNA-seq workflows, and the R -statistical environment.
-This walkthrough can be broken down into different types of predictor -variables that can be incorporated in models of gene expression. In the -order we address them, these variables include:
-Two important considerations when choosing a DE package to analyze -RNA-seq data are the distribution family used by the package and the -effect types it incorporate into a model. The DE packages we will cover -assume the following read count distributions and are capable of fitting -the following effect types:
-Program | -Distribution | -Continuous fixed eff. | -Random intercepts | -Random slopes | -Interactive eff. | -Non-linear eff. | -
---|---|---|---|---|---|---|
edgeR | -Negative binomial generalized linear model | -✔ | -✖ | -✖ | -✔ | -✔ | -
DESeq2 | -Negative binomial generalized linear model | -✔ | -✖ | -✖ | -✔ | -✔ | -
Voom | -Mean-variance linear model | -✔ | -✔ | -✖ | -✔ | -✔ | -
We will walk you through (i) how to conduct essential checks of your -data such as principal components analysis and observing the -distribution of filtered read counts, (ii) evaluating each package’s -functionality for multifactorial model fitting, and (iii) how to fit -each type of predictor by performing differential expression tests using -real data. These data come from an experiment that exposed the Antarctic -pteropod (a free-swimming planktonic snail) Limacina helicina -antarctica to three continuous levels of ocean acidification (OA) -for either 0.5 or 7 days (Bogan, Johnson, and -Hofmann 2020). These RNA-seq counts are derived from reads mapped -and counted via RSEM (Li and Dewey 2011) -using a de novo reference transcriptome that includes -transcript isoforms assembled by Johnson and -Hofmann (2016).
-Throughout this walkthrough, we will provide examples of custom -scripts for fitting linear, linear mixed models, generalized linear -models to read count data using for loops in order to provide -functionality where it cannot be met by most differential expression -packages.
-At the end of the walkthrough, we demonstrate how to compare model -predictions by different packages for different effect types commonly -used in multifactorial RNAseq analyses.
-Let’s start by loading our required packages…
-## Unhash and run the code below if you believe you may need to install the packages loaded below
-
-#invisible(lapply(c( "tidyverse", "ape", "vegan", "GGally",
- #"rgl", "adegenet", "MASS",
- #"data.table", "plyr", "lmtest", "reshape2", "Rmisc", "lmerTest","statmod"),
- #function(p){
- #if(! p %in% rownames(installed.packages())) {
- #install.packages(p)
- #}
- #library(p, character.only=TRUE)
- #}))
-
-#if (!require("BiocManager", quietly = TRUE))
- #install.packages("BiocManager")
-
-#BiocManager::install(c("DESeq2","edgeR","arrayQualityMetrics"))
-
-# Load packages
-library(DESeq2)
-library(edgeR)
-library(tidyverse)
-library(ape)
-library(vegan)
-library(GGally)
-library(arrayQualityMetrics)
-library(rgl)
-library(adegenet)
-library(MASS)
-library(data.table)
-library(plyr)
-library(lmtest)
-library(reshape2)
-library(Rmisc)
-library(lmerTest)
-Before model fitting and testing for DE, it is important to visually -inspect read counts pre- and post-filtering. In this walk through, we -will plot the distributions of read counts and sample loading to -multidimensional scaling axes (e.g., plotting principal components or -principal coordinates analyses). Looking at these plots, we can (i) -determine whether the distribution of our data match the assumptions of -model families used by different packages and (ii) determine whether one -or multiple variables within a dataset explain a significant degree of -variation in gene expression across replicates. Depending on the -distribution of your data and the variables that appear important (or -variable you would like to incorporate into your models), we will then -demonstrate how to choose an appropriate package to test for DE.
-Most DE packages assume that read counts possess a negative binomial -distribution. The negative binomial distribution is an extension of -distributions for binary variables such as the Poisson distribution, -allowing for estimations of “equidispersion” and “overdispersion”, equal -and greater-than-expected variation in expression attributed to -biological variability . However, RNA-seq datasets can exhibit poor fits -with the negative binomial distribution (Hawinkel -et al. 2020). It is well worth visualizing and testing the -distributions of RNA-seq read counts before and after filtering. Keep in -mind that the distribution applies not to the raw data, but the -residuals of a model. Additionally, individual models are fit to the -read counts of each gene. Thus, the most accurate representation of how -appropriate a negative binomial distribution is for read counts would -come from plotting the distribution of residuals across all genes -following model fitting. Below is a visual example of negative binomial -distributions that possess different means and variances:
- -Now let’s first plot the distribution of unfiltered reads from our -dataset:
-# Read in matrix of RSEM expected read counts
-data <- read.delim("FUN_01_DGE_comparison_v2/RNA_data/GE2_expected_counts_gene.matrix", sep = "\t", header = T, row.names = 1)
-
-# Peak at data to check that it looks okay
-head(data)
-## B7a.genes.results B7b.genes.results B7c.genes.results
-## TR100110|c0_g1_i1 4.00 10.00 3.00
-## TR101578|c0_g1_i1 0.00 0.00 0.00
-## TR1047|c0_g1_i1 14.00 19.00 11.00
-## TR105096|c0_g1_i1 14.00 12.00 9.00
-## TR107626|c1_g1_i1 4530.93 28139.91 14141.83
-## TR11301|c0_g1_i1 0.00 0.00 0.00
-## B12a.genes.results B12b.genes.results B12c.genes.results
-## TR100110|c0_g1_i1 6.00 9.0 8.00
-## TR101578|c0_g1_i1 0.00 0.0 0.00
-## TR1047|c0_g1_i1 15.00 14.0 9.00
-## TR105096|c0_g1_i1 15.00 5.0 10.00
-## TR107626|c1_g1_i1 80846.09 5649.4 30756.08
-## TR11301|c0_g1_i1 0.00 0.0 0.00
-## R7a.genes.results R7b.genes.results R7c.genes.results
-## TR100110|c0_g1_i1 18.0 15.00 6.00
-## TR101578|c0_g1_i1 0.0 0.00 0.00
-## TR1047|c0_g1_i1 16.0 16.00 27.00
-## TR105096|c0_g1_i1 25.0 15.00 6.00
-## TR107626|c1_g1_i1 137592.5 45299.57 14843.47
-## TR11301|c0_g1_i1 0.0 0.00 0.00
-## R12a.genes.results R12b.genes.results R12c.genes.results
-## TR100110|c0_g1_i1 11.00 13.0 9
-## TR101578|c0_g1_i1 0.00 0.0 0
-## TR1047|c0_g1_i1 23.00 14.0 11
-## TR105096|c0_g1_i1 14.00 45.0 12
-## TR107626|c1_g1_i1 21210.52 230436.5 0
-## TR11301|c0_g1_i1 0.00 0.0 0
-## Y7a.genes.results Y7b.genes.results Y7c.genes.results
-## TR100110|c0_g1_i1 6 4 12.0
-## TR101578|c0_g1_i1 0 0 0.0
-## TR1047|c0_g1_i1 20 3 18.0
-## TR105096|c0_g1_i1 9 1 13.0
-## TR107626|c1_g1_i1 0 0 112214.6
-## TR11301|c0_g1_i1 0 0 0.0
-## Y12a.genes.results Y12b.genes.results Y12c.genes.results
-## TR100110|c0_g1_i1 21.00 12 12.00
-## TR101578|c0_g1_i1 0.00 0 0.00
-## TR1047|c0_g1_i1 25.00 14 9.00
-## TR105096|c0_g1_i1 14.00 19 24.00
-## TR107626|c1_g1_i1 86243.14 148100 74153.54
-## TR11301|c0_g1_i1 0.00 0 0.00
-The table of read counts above has rows representing individual -transcripts and columns representing different replicate samples across -treatments.
-# Name experimental samples: names correspond to pCO2 treatment (300, 600, 900) + days (12 h2 vs 7 days) + replicate ID
-colnames(data) <- c("300.7.a", "300.7.b", "300.7.c","300.12.a", "300.12.b", "300.12.c",
- "900.7.a", "900.7.b", "900.7.c", "900.12.a", "900.12.b", "900.12.c",
- "600.7.a", "600.7.b", "600.7.c", "600.12.a", "600.12.b", "600.12.c")
-
-## Create 'targets' and 'Group dataframe, expressing experimental variables for DEG analysis
-pCO2 <- as.numeric(c( 255, 255, 255, 255, 255, 255,
- 530, 530, 530, 530, 530, 530,
- 918, 918, 918, 918, 918, 918))
-
-treatment <- c("B","B", "B", "B", "B", "B",
- "R", "R", "R", "R", "R", "R",
- "Y", "Y", "Y", "Y", "Y", "Y")
-
-day <- as.numeric(c(7, 7, 7, .5, .5, .5,
- 7, 7, 7, .5, .5, .5,
- 7, 7, 7, .5, .5, .5))
-
-targets <- data.frame(pCO2, day, treatment)
-targets$grouping <- paste(targets$pCO2, targets$day, sep = ".")
-
-# The group factor represents the combined levels of an experimental replicate across all variables
-Group <- factor(paste(targets$day, targets$pCO2, sep = "_"))
-
-# Data must be rounded to nearest integer in order to be fit for negative binomial distribution
-data_input <- round(data)
-
-# Peak at rounded data
-head(data_input)
-## 300.7.a 300.7.b 300.7.c 300.12.a 300.12.b 300.12.c 900.7.a
-## TR100110|c0_g1_i1 4 10 3 6 9 8 18
-## TR101578|c0_g1_i1 0 0 0 0 0 0 0
-## TR1047|c0_g1_i1 14 19 11 15 14 9 16
-## TR105096|c0_g1_i1 14 12 9 15 5 10 25
-## TR107626|c1_g1_i1 4531 28140 14142 80846 5649 30756 137592
-## TR11301|c0_g1_i1 0 0 0 0 0 0 0
-## 900.7.b 900.7.c 900.12.a 900.12.b 900.12.c 600.7.a 600.7.b
-## TR100110|c0_g1_i1 15 6 11 13 9 6 4
-## TR101578|c0_g1_i1 0 0 0 0 0 0 0
-## TR1047|c0_g1_i1 16 27 23 14 11 20 3
-## TR105096|c0_g1_i1 15 6 14 45 12 9 1
-## TR107626|c1_g1_i1 45300 14843 21211 230436 0 0 0
-## TR11301|c0_g1_i1 0 0 0 0 0 0 0
-## 600.7.c 600.12.a 600.12.b 600.12.c
-## TR100110|c0_g1_i1 12 21 12 12
-## TR101578|c0_g1_i1 0 0 0 0
-## TR1047|c0_g1_i1 18 25 14 9
-## TR105096|c0_g1_i1 13 14 19 24
-## TR107626|c1_g1_i1 112215 86243 148100 74154
-## TR11301|c0_g1_i1 0 0 0 0
-# Plot distribution of unfiltered read counts across all samples
-ggplot(data = data.frame(rowMeans(data_input)),
- aes(x = rowMeans.data_input.)) +
- geom_histogram(fill = "grey") +
- xlim(0, 500) +
- theme_classic() +
- labs(title = "Distribution of unfiltered reads") +
- labs(y = "Density", x = "Raw read counts",
- title = "Read count distribution: untransformed, unnormalized, unfiltered")
-
-As you can see in the above plot, the raw distribution of all read -counts takes on a left-skewed negative binomial distribution similar to -the purple distribution in the example above. While looking at the -distribution of your raw reads is useful, these are not in fact the data -you will be inputting to tests of differential expression. Let’s plot -the distribution of filtered reads normalized by library size, expressed -as log2 counts per million reads (logCPM). These are the reads we will -use in our test, and after we plot their distribution, we will conduct -one more, slightly more robust, test of our data’s fit to the negative -binomial distribution using residuals from fitted models.
-# Make a DGEList object for edgeR
-y <- DGEList(counts = data_input, remove.zeros = TRUE)
-
-#Let's remove samples with less than 0.5 cpm (this is ~10 counts in the count file) in fewer then 9/12 samples
-keep <- rowSums(cpm(y) > .5) >= 9
-
-table(keep)
-## keep
-## FALSE TRUE
-## 18871 62579
-# Set keep.lib.sizes = F and recalculate library sizes after filtering
-y <- y[keep, keep.lib.sizes = FALSE]
-
-y <- calcNormFactors(y)
-
-# Calculate logCPM
-df_log <- cpm(y, log = TRUE, prior.count = 2)
-
-# Plot distribution of filtered logCPM values
-ggplot(data = data.frame(rowMeans(df_log)),
- aes(x = rowMeans.df_log.) ) +
- geom_histogram(fill = "grey") +
- theme_classic() +
- labs(y = "Density", x = "Filtered read counts (logCPM)",
- title = "Distribution of normalized, filtered read counts")
-
-Our raw data appear to follow a strong negative binomial -distribution! Later, we will see whether this holds for residuals from -fitted negative binomial GLMs.
-Before analyzing our data, it is essential that we look at the -multivariate relationships between our samples based on -transcriptome-wide expression levels. Below is example code and output -for a principal coordinates analysis (PCOA) plot that visualizes -multifactorial RNA-seq replicates according to two predictor variables -across major and minor latent variables or PCOA axes. These predictor -variables, as discussed above, are pCO\({_2}\) and time-in-treatment.
-# Export pcoa loadings
-dds.pcoa = pcoa(vegdist(t(df_log <- cpm(y, log = TRUE, prior.count = 2)),
- method = "euclidean") / 1000)
-
-# Create df of MDS vector loading
-scores <- dds.pcoa$vectors
-
-## Plot pcoa loadings of each sample, groouped by time point and pCO2 treatment
-
-# Calculate % variation explained by each eigenvector
-percent <- dds.pcoa$values$Eigenvalues
-cumulative_percent_variance <- (percent / sum( percent)) * 100
-
-# Prepare information for pcoa plot, then plot
-color <- c("steelblue1", "tomato1", "goldenrod1")
-par(mfrow = c(1, 1))
-plot(
- scores[, 1],
- scores[, 2],
- cex = .5,
- cex.axis = 1,
- cex.lab = 1.25,
- xlab = paste("PC1, ", round(cumulative_percent_variance[1], 2), "%"),
- ylab = paste("PC2, ", round(cumulative_percent_variance[2], 2), "%")
- )
-
-# Add visual groupings to pcoa plot
-ordihull(
- scores,
- as.factor(targets$treatment),
- border = NULL,
- lty = 2,
- lwd = .5,
- label = F,
- col = color,
- draw = "polygon",
- alpha = 100,
- cex = .5
- )
-
-ordispider(scores, as.factor(targets$grouping), label = F) # Vectors connecting samples in same pCO2 x time group
-
-ordilabel(scores, cex = 0.5) # Label sample IDs
-
-In the PCOA above, blue = low p\(CO_{2}\), yellow = moderate p\(CO_{2}\), and red = high p\(CO_{2}\). From this plot, we can see that -treatment and time both influence multivariate gene expression across -the RNA-seq samples. By and large, samples cluster according to these -two predictors in a manner consistent with what we would expect from our -experimental design if p\(CO_{2}\) and time were to affect gene -expression.
-logCPM.pca <- prcomp(t (df_log))
-logCPM.pca.proportionvariances <-
-((logCPM.pca$sdev ^ 2) / (sum(logCPM.pca$sdev ^ 2))) * 100
-
-## Do treatment groups fully segregate? Wrap samples by pCO2 x time, not just pCO2
-# Replot using logCPM.pca
-plot(
- logCPM.pca$x,
- type = "n",
- main = NA,
- xlab = paste("PC1, ", round(logCPM.pca.proportionvariances[1], 2), "%"),
- ylab = paste("PC2, ", round(logCPM.pca.proportionvariances[2], 2), "%")
- )
-
-points(logCPM.pca$x,
- col = "black",
- pch = 16,
- cex = 1)
- colors2 <-
- c("steelblue1",
- "dodgerblue2",
- "tomato1",
- "coral",
- "goldenrod1",
- "goldenrod3")
-
- ordihull(
- logCPM.pca$x,
- targets$grouping,
- border = NULL,
- lty = 2,
- lwd = .5,
- col = colors2,
- draw = "polygon",
- alpha = 75,
- cex = .5,
- label = T
- )
-
-This plot is only a slight adjustment of the first PCOA. If you’d -like to wrap each multifactorial group and get a better idea of -potential overlap between clusters, the plot and code above can be -applied to do so.
-Now that we have conducted QC analyses and plots of the read count -data, it is time to begin model fitting and testing for differential -expression.
-Gene expression traits can exhibit non-linear performance curves -across a continuous variable much like other physiological traits. -Studies that incorporate more than two levels of a continuous predictor -provide an opportunity to test for non-linear effects on gene -expression. Similar to performance traits such as growth or metabolic -rates, gene expression can be modelled across a continuous variable by -fitting non-linear ‘performance curves’ to read count data. Rivera -et al. 2021 recently discussed the benefits of fitting -non-linear reaction norms to gene expression data in order to better -model acclimation by tolerant and sensitive genotypes to an -environmental stressor. In a conceptual figure of theirs, shown below, -they depict non-linear variation in expression before, during, and after -exposure to a stressor for four reaction norm phenotypes: (i) a tolerant -conspecific that frontloads (e.g., constitutively upregulates) -stress-induced transcripts, (ii) a tolerant conspecific that exhibits -reduced overall expression of stress-responsive transcripts, (iii) a -sensitive conspecific that upregulates inducible transcripts and cannot -recover post-stress, and (iv) a resilient conspecific that mounts a -transcriptional stress response followed by recovery and downregulation -(Rivera et al. 2021).
--From Rivera et al. 2021 - “Transcriptomic resilience and -timing. (a) Gene expression reaction norms of four strategies during -recovery after a stressor. We use triangles again for patterns that may -confer tolerance and circles for patterns associated with stress -sensitivity. While all triangle paths show a return to baseline -(resilience) the pink (frontloading) and yellow (dampening) are also -depicting differences in baseline and plasticity and are therefore -labelled differently. (b) Adapted from the rolling ball analogy commonly -used for ecological resilience and depicted in Hodgson et al. (2015). -Each ball represents a gene showing a color-matched expression pattern -in (a). Landscapes represent expression possibilities during a stress -event. In the absence of stress, the ball will settle in a trough, -representing baseline expression levels. Elasticity (rate of return to -the baseline) is represented by the size of the arrow (i.e., larger -arrows have faster rates of return). Pink dotted line is the expression -landscape for the frontloaded ball. (c) Using Torres et al. (2016) loops -through disease space as an alternative framework of an organism’s path -through stress response and recovery. The color gradient represents the -resulting phenotype for a given path through stress and recovery space, -though x-and y-axis can denote any two parameters that are correlated -but with a time lag.”
-Non-linear reaction norms can be modelled across time, such as Rivera -et al. have presented, or across multiple levels of an abiotic -variable such as temperature, pH, etc. in order to fit a performance -curve. Performance curves are a fundamental tool in ecological -physiology, and enable more robust hypothesis testing in RNA-seq studies -of environmental acclimation. For example, testing for variation in the -shape of gene expression performance curves between early and long-term -exposure timepoints can provide critical information about the role of -gene expression in acclimation. Below we outline how to fit and test for -non-linear gene expression performance curves in multifactorial RNA-seq -experiments using examples in edgeR, DESeq2, and custom code.
-One of the simplest non-linear relationships that can be fitted to -the expression of a transcript across an continuous variable is a -second-order polynomial, otherwise known as a quadratic function, which -can be expressed as \(y_{i} = \mu + -\beta_{1}x^2 + \beta_{2}x\) where \(y\) is the abundance of a given transcript -(\(i\)), \(\mu\) is the intercept, and \(y\) is the continuous variable. For the -parabola generated by fitting a second-order polynomial, \(\beta_{1}\) > 0 opens the parabola -upwards while \(\beta_{1}\) < 0 -opens the parabola downwards. The vertex of the parabola is controlled -by \(\beta_{2}\) such that when \(\beta_{1}\) is negative, greater \(\beta_{2}\) values result in the vertex -falling at higher values of \(x\).
-Quadratic polynomials applied to phenotypic performance curves -commonly possess negative \(\beta_{1}\) -values with positive \(\beta_{2}\) -values: a downard-opening parabola with a positive vertex. However, -quadtratic curves fitted to gene expression data can take on a variety -of postiive or negative forms similar to exponential curves, saturating -curves, and parabolas. For instance, the expression of a gene may peak -an intermediate level of an environmental level before crashing or it -may exponentially decline across that variable. Such trends may better -model changes in the transcription of a gene compared to a linear model. -To get started, we will fit non-linear second order polynomials before -testing for whether model predictions for a given gene are significantly -improved by a non-linear linear model.
-Let’s fit a second-order polynomial for the effect of -pCO\(_{2}\) using edgeR. Using -differential expression tests, we will then determine whether -pCO\(_{2}\) affected a gene’s -rate of change in expression and expression maximum by applying -differential expression tests to \(\beta_{1}\) and \(\beta_{2}\) parameters. By testing for -differential expression attributed to intereactions between time and -\(\beta_{1}\) or \(\beta_{2}\), we will then test for whether -these parameters were significantly different across exposure times such -that 0.5 days and 7 days of acclimation to pCO\(_{2}\) altered the rate of change in -expression across pCO\(_{2}\) -(\(\beta_{1}\)) or the maximum of -expression (\(\beta_{2}\)).
-Since this is the first time we’re fitting models to our data in this -walkthrough, we will also output some important plots to help us explore -and QC model predictions as we work toward testing for differential -expression attributed to non-linear effects.
-# Square pCO2 variable
-pCO2_2 <- pCO2 ^ 2
-
-# Estimate dispersion coefficients
-y1 <- estimateDisp(y, robust = TRUE) # Estimate mean dispersal
-
-# Plot tagwise dispersal and impose w/ mean dispersal and trendline
-plotBCV(y1)
-
-The above figure is an output of the plotBCV() function in edgeR, -which visualises the biological coefficient of variation (otherwise -known as dispersion) parameter fitted to individual genes (black -circles), fitted across gene expression level (blue trendline), and -averaged across the entire transcriptome (red horizontal line). The -BCV/dispersion parameter is a necessary parameter, symbolized as \(\theta\) in model specification, that must -be estimated in negative binomial GLMs. \(\theta\) represents the ‘overdispersion’ or -the shape of a negative binomial distribution and is thus important for -defining the distribution of read count data for a given gene during -model fitting. Dispersion or \(\theta\) -can be input during model fitting in edgeR using the three estimates -visualized in the above regression.
-Now that we’ve estimated dispersion, we will use these estimates to -fit negative binomial GLMs to our read count data using the glmQLFit() -function in edgeR, one of the more robust model fitting functions -offered by edgeR.
-# Fit multifactorial design matrix
-design_nl <-
- model.matrix(~ 1 + pCO2_2 + pCO2 + pCO2_2:day + pCO2:day) # Generate multivariate edgeR glm
-
-# Fit quasi-likelihood, neg binom linear regression
-nl_fit <-
- glmQLFit(y1, design_nl) # Fit multivariate model to counts
-
-## Test for effect of pCO2 and pCO2^2
-nl_pCO2_2 <-
- glmQLFTest(nl_fit,
- coef = 2,
- contrast = NULL,
- poisson.bound = FALSE) # Estimate significant DEGs
-nl_pCO2 <-
- glmQLFTest(nl_fit,
- coef = 3,
- contrast = NULL,
- poisson.bound = FALSE) # Estimate significant DEGs
-
-# Make contrasts
-is.de_nl_pCO2 <-
- decideTestsDGE(nl_pCO2, adjust.method = "fdr", p.value = 0.05)
-is.de_nl_pCO2_2 <-
- decideTestsDGE(nl_pCO2_2, adjust.method = "fdr", p.value = 0.05)
-
-# Summarize differential expression attributed to pCO2 and pCO2^2
-summary(is.de_nl_pCO2)
-## pCO2
-## Down 14166
-## NotSig 35037
-## Up 13376
-summary(is.de_nl_pCO2_2)
-## pCO2_2
-## Down 11091
-## NotSig 39573
-## Up 11915
-We have just fit our first GLM to our read count data and have tested -for differential expression across pCO\(_{2}\). At this stage, it is important to -output a few diagnostic plots. For example, edgeR has the function -‘plotMD()’ which, when input with a differential expression test object -such as a glmQLFTest result, will produce a plot of differential -expression log2 fold change values across gene expression level (log2 -counts per million or logCPM). logCPM is a major component of gene -function and statistical power, and is a useful variable to plot in -order to make initial assessments of differential expression -results.
-Visualizations of differential expression (logFC) across baseline -logCPM can be produced by the plotMD() (short for “mean-difference”) -function in edgeR. To do so, plotMD() is input with the results of -glmQLFTest, which we used above to test for differential expression -attributable to the parameters pCO\(_{2}\) and pCO\(_{2}\)^2. Let’s produce two such plots for -both parameters where “pCO2_2” represents the pCO\(_{2}\)^2 parameter:
-# Plot differential expression due to pCO2 and pCO2^2
-plotMD(nl_pCO2)
-
-plotMD(nl_pCO2_2)
-
-In the two mean-difference plots above, large colored points -represent genes exhibiting significant differential expression -attributed to pCO\(_{2}\) or -pCO\(_{2}\)^2 below. As you -can see, there is a lot of differential expression in this experiment -resulting from both of these parameters. Part of the reason for this -large number of significant differentially expressed genes (DEGs) is -that we have not applied a cutoff value for logFC. For example, a logFC -cutoff of 2.0 value would not designate any genes with an absolute logFC -value less than 0.05 as a significant DEG. However, deciding on a logFC -cutoff is very tricky! Since we used a continuous predictor for -pCO\(_{2}\) during model -fitting, the values in the y-axis of this plot are slopes representing -the rate of change in expression per unit pCO\(_{2}\) (\(\mu\)atm). Determining what slope -represents a significant change in expression requires informed -reasoning and a strong body of prior data. For that reason, we are not -applying a logFC cutoff in this walkthrough.
-Next, we will use glmQLFTest() to test for differential expression -attributed to interactions between time and pCO\(_{2}\) or time and pCO\(_{2}\)^2 before producing mean-difference -plots visualizing DEGs derived from these two interactions:
-## Test for interactions between time and pCO2 or pCO2^2
-nl_pCO2_int <-
- glmQLFTest(nl_fit,
- coef = 4,
- contrast = NULL,
- poisson.bound = FALSE) # Estimate significant DEGs
-nl_pCO2_2_int <-
- glmQLFTest(nl_fit,
- coef = 5,
- contrast = NULL,
- poisson.bound = FALSE) # Estimate significant DEGs
-
-# Make contrasts
-is.de_nl_pCO2_int <-
- decideTestsDGE(nl_pCO2_int, adjust.method = "fdr", p.value = 0.05) # Make contrasts
-is.de_nl_pCO2_2_int <-
- decideTestsDGE(nl_pCO2_2_int, adjust.method = "fdr", p.value = 0.05)
-
-# Summarize differential expression attributed to pCO2 and pCO2^2
-summary(is.de_nl_pCO2_int)
-## pCO2_2:day
-## Down 2
-## NotSig 62573
-## Up 4
-summary(is.de_nl_pCO2_2_int)
-## pCO2:day
-## Down 390
-## NotSig 61918
-## Up 271
-# Plot differential expression due to pCO2 and pCO2^2
-plotMD(nl_pCO2_int)
-
-plotMD(nl_pCO2_2_int)
-
-In the mean-difference plots above, it is clear that there is -remarkably less differential expression associated with interactions -between time and pCO\(_{2}\) -or pCO\(_{2}\)^2 compared to -the direct effects of pCO\(_{2}\) and pCO\(_{2}\)^2. This is expected, as the -statistical power necessary for detecting such interactions is lesser -and there are grounds to expect that interactions would be less frequent -during our experiment.
-Taking a detour from tests for DEGs attributed to non-linear -interactions, we can now output the residuals of our GLMs in order to -better test for whether our data fit the assumptions of negative -binomial distribution families. If the distribution of residuals from -our GLMs are normal, this indicates that our data meet the assumption of -the negative binomial distribution. We will use the equation for -estimating Pearson residuals: \[ residual = -\frac{observed - fitted} {\sqrt{fitted(dispersion*fitted)}} -\]
-# Output observed
-y_nl <- nl_fit$counts
-
-# Output fitted
-mu_nl <- nl_fit$fitted.values
-
-# Output dispersion or coefficient of variation
-phi_nl <- nl_fit$dispersion
-
-# Calculate denominator
-v_nl <- mu_nl*(1+phi_nl*mu_nl)
-
-# Calculate Pearson residual
-resid.pearson <- (y_nl-mu_nl) / sqrt(v_nl)
-
-# Plot distribution of Pearson residuals
-ggplot(data = melt(as.data.frame(resid.pearson)), aes(x = value)) +
- geom_histogram(fill = "grey") +
- xlim(-2.5, 5.0) +
- theme_classic() +
- labs(title = "Distribution of negative binomial GLM residuals",
- x = "Pearson residuals",
- y = "Density")
-
-Our residuals appear to be normally distributed, indicating that our -data fit the negative binomial distribution assumed by the GLM.
-Now let’s get back to analyzing non-linear effects on gene -expression!
-A variety of non-linear patterns may underly the significant effects -we’ve observed above. For example, interactions between the vertex of -expression and time may be attributed to variation in expression peaks -across convex curves or the rate of decline in an exponentially -decreasing curve among other patterns. Below we will assign transcripts -exhibiting significant non-linear variation across pCO\(_{2}\) to different groups based on the -values of parameters such as \(\beta_{1}\) and \(\beta_{2}\). Then we will plot the -expression of these grouped transcripts across pCO\(_{2}\) and time to visualize different -sources of non-linear variation.
-## Bin transcripts based on (i) whether they have a significant positive or negative vertex and then (ii) whether they showed significant interactions between beta1 (vertex value) and time.
-
-# Export diff expression data for transcripts with significant DE associated with PCO2^2 parameter
-nl_pCO2_2_sig <-
- topTags(
- nl_pCO2_2,
- n = (11091 + 11915),
- adjust.method = "BH",
- p.value = 0.05
- )
-nl_pCO2_2_sig_geneids <- row.names(nl_pCO2_2_sig) #Output a list of geneids associated with sig PCO2^2 effect
-
-nl_pCO2_sig <-
- topTags(
- nl_pCO2,
- n = (14166 + 13376),
- adjust.method = "BH",
- p.value = 0.05
- )
-nl_pCO2_sig_geneids <- row.names(nl_pCO2_sig) #Output a list of geneids associated with sig PCO2 effect
-
-# Create tabulated dataframe of mean expression across each pCO2 level with metadata for transcript ID and timepoint
-logCPM_df <- as.data.frame(df_log)
-
-# Create tabularized df containing all replicates using 'melt' function in reshape2
-logCPM_df$geneid <- row.names(logCPM_df)
-tab_exp_df <- melt(logCPM_df,
- id = c("geneid"))
-
-# Add covariate information for time and pCO2
-tab_exp_df$pCO2 <- substr(tab_exp_df$variable, 1, 3)
-tab_exp_df$time <- as.numeric(substr(tab_exp_df$variable, 5, 5))
-
-# Correct pCO2s to exact values
-tab_exp_df$pCO2 <- as.numeric(ifelse(
- tab_exp_df$pCO2 == "300",
- 255,
- ifelse(tab_exp_df$pCO2 == "900", 930,
- 518)
- ))
-
-# Correct time to exact values
-tab_exp_df$time <- as.numeric(ifelse(tab_exp_df$time == "1", 0.5, 7))
-
-# Create binary variable in df_all_log for significant non-linear expression
-tab_exp_df$pCO2_2_sig <-
- ifelse(tab_exp_df$geneid %in% nl_pCO2_2_sig_geneids, "Yes", "No")
-tab_exp_df$pCO2_sig <-
- ifelse(tab_exp_df$geneid %in% nl_pCO2_sig_geneids, "Yes", "No")
-
-# Create a binary variable related to up or down-regulation
-up_genes <- filter(nl_pCO2_sig$table, logFC > 0)
-tab_exp_df$logFC_dir <-
- ifelse(tab_exp_df$geneid %in% row.names(up_genes), "Up", "Down")
-
-# Add geneid to nl_pCO2_int$coefficients
-nl_pCO2_int$coefficients$geneid <- row.names(nl_pCO2_int$coefficients)
-
-# Estimate average logCPM per gene per timepoint
-tab_exp_avg <- summarySE(
- measurevar = "value",
- groupvars = c("pCO2", "time", "geneid",
- "pCO2_sig", "pCO2_2_sig", "logFC_dir"),
- data = tab_exp_df
- )
-
-# First exploratory plot of non-linear expression grouping by exposure time and direction of differential expression
-ggplot(data = filter(tab_exp_avg, pCO2_2_sig == "Yes"),
- aes(x = pCO2, y = value)) +
- geom_path(
- alpha = 0.01,
- size = 0.25,
- stat = "identity",
- aes(group = as.factor(geneid))) +
- facet_grid(logFC_dir ~ time) +
- theme_classic() +
- theme(strip.background = element_blank()) +
- labs(y = "Avg logCPM", title = "Non-linear changes in GE output by edgeR")
-
-Our plot of gene expression across tens of thousands of transcripts -exhibiting significant non-linear changes across pCO\(_{2}\) appears to include many trends that -appear… well, linear. This is a pervasive issue in modeling non-linear -regressions, and one potential pitfall of using outputs from packages -such as edgeR or DESeq2 alone in testing for non-linear effects. The -FDR-adjusted \(p\)-values we have used -determine significance of non-linear effects essentially tell us the -probability that a parameter value equal to or greater to what we have -fitted could be generated given a random distribution of read counts. -The \(p\)-value is not a representation -of the strength of a non-linear effect relative to a linear effect. -Numerous genes that nominally show significant non-linear effects of -pCO\(_{2}\) may be only weakly -affected, and a linear effect may in fact be more probable than a -non-linear one despite what our p-values tell us. Instead of asking “for -what genes may there be significant, non-linear effects of -pCO\(_{2}\)?”, we should ask -“for what genes should we test for significant, non-linear -effects?”.
-One of the best ways to determine whether a non-linear model is -appropriate for a transcript is to determine whether it is more probable -that its expression is linear or non-linear relative to a continuous -predictor. We can calculate this relative probability using a likelihood -ratio test (LRT). In the code chunk below, we will fit linear and 2nd -order non-linear models to the expression of each gene before applying -LRTs to each transcript. We will then further filter our edgeR dataset -based on (i) significant differential expression attributed to -non-linear effects and (ii) a significant LRT ratio supporting -non-linear effects. Then, we will replot the expression levels of this -re-filtered set. The code below fits gaussian linear models to -log2-transformed CPM values, but can be adjusted to fit negative -binomial GLMs to untransformed CPM similar to edgeR and DESeq2 by -setting using the MASS package to set ‘family = negative_binomial(theta -= \(\theta\))’ where \(\theta\) = the dispersion estimate or -biological coefficient of variation for a given transcript.
-## Using dlply, fit linear and non-linear models to each gene
-# Create pCO2^2 variable in df_all_log
-tab_exp_df$pCO2_2 <- tab_exp_df$pCO2^2
-
-# Fit linear models - should take about 4 minutes
-lms <- dlply(tab_exp_df, c("geneid"), function(df)
-lm(value ~ pCO2 + time + pCO2:time, data = df))
-
-# Fit non-linear models - should take about 2 minutes
-nlms <- dlply(tab_exp_df, c("geneid"), function(df)
-lm(value ~ pCO2 + pCO2_2 + time + pCO2:time + pCO2_2:time, data = df))
-
-# Output nlm coefficients into dataframe
-nlms_coeff <- ldply(nlms, coef)
-head(nlms_coeff)
-## geneid (Intercept) pCO2 pCO2_2 time
-## 1 TR107626|c1_g1_i1 1.9361224 3.995760e-02 -3.819357e-05 3.39371568
-## 2 TR141909|c0_g1_i1 2.5171022 -2.413737e-03 1.468135e-06 -0.18785174
-## 3 TR141946|c0_g1_i1 -1.0829071 2.491456e-03 -3.180644e-06 -0.27604730
-## 4 TR141946|c0_g1_i2 0.6669886 5.415472e-03 -3.724832e-06 0.11726202
-## 5 TR141951|c0_g1_i1 2.0588461 -7.240810e-04 1.457408e-06 0.01615581
-## 6 TR141972|c0_g1_i1 0.3458156 -5.983124e-05 -5.072419e-06 0.16036294
-## pCO2:time pCO2_2:time
-## 1 -0.0180847725 1.627584e-05
-## 2 0.0007717623 -5.931927e-07
-## 3 0.0008763072 -5.730126e-07
-## 4 -0.0006632481 5.727356e-07
-## 5 0.0000993039 -1.247330e-07
-## 6 -0.0001885367 3.957829e-07
-## Apply LRTs to lm's and nlm's for each transcript - should take about 2 minutes
-lrts <- list() # Create list to add LRT results to
-
-for (i in 1:length(lms)) {
- lrts[[i]] <- lrtest(lms[[i]], nlms[[i]]) # Apply LRTs with for loop
-}
-
-## Filter lrt results for transcripts with significantly higher likelihoods of nl model
-lrt_dfs <- list()
-
-# Turn list of LRT outputs into list of dataframes containing output info
-for (i in 1:length(lrts)) {
- lrt_dfs[[i]] <- data.frame(lrts[i])
-}
-
-# Create singular dataframe with geneids and model outputs for chi-squared and LRT p-value
-lrt_coeff_df <- na.omit(bind_rows(lrt_dfs, .id = "column_label")) # na.omit removes first row of each df, which lacks these data
-
-# Add geneid based on element number from original list of LRT outputs
-lrt_coeff_df <- merge(lrt_coeff_df,
- data.frame(geneid = names(nlms),
- column_label = as.character(seq(length(
- nlms
- )))),
- by = "column_label")
-
-# Apply FDR adjustment to LRT p-values before filtering for sig non-linear effects
-lrt_coeff_df$FDR <- p.adjust(lrt_coeff_df$Pr..Chisq., method = "fdr")
-
-# Filter LRT results for sig FDR coeff... produces 162 genes
-lrt_filt <- filter(lrt_coeff_df, FDR < 0.05)
-
-## Plot sig nl genes according to LRT, grouped by timepoint and direction of beta 1 coefficient
-# Add beta coefficients to logCPM df
-pCO2_pos <- filter(nlms_coeff, pCO2 > 0)
-pCO2_2_pos <- filter(nlms_coeff, pCO2_2 > 0)
-
-# Bin genes based on positive or negative pCO2 and pCO2^2 betas
-tab_exp_avg$pCO2_binom <- ifelse(tab_exp_avg$geneid %in% pCO2_pos$geneid, "Positive", "Negative")
-tab_exp_avg$pCO2_2_binom <- ifelse(tab_exp_avg$geneid %in% pCO2_2_pos$geneid, "Concave", "Convex")
-
-# Filter for how many gene id's with significant likelihood of nl effect in LRT
-LRT_filt_df <- filter(tab_exp_avg, geneid %in% lrt_filt$geneid)
-
-# Plot
-ggplot(data = LRT_filt_df,
- aes(x = pCO2, y = value)) +
- geom_path(
- alpha = .25,
- size = 0.25,
- stat = "identity",
- aes(group = as.factor(geneid))
- ) +
- facet_grid(pCO2_2_binom ~ time) +
- geom_smooth(method = "loess", se = TRUE, span = 1) +
- theme_classic() +
- theme(strip.background = element_blank()) +
- labs(y = "Avg logCPM", title = "Non-linear changes in GE output by LRTs")
-
-# Count how many gene id's with significant likelihood of nl effect in LRT... 162 genes
-nrow(as.data.frame(unique(LRT_filt_df$geneid)))
-## [1] 162
-The above plot, which visualizes the expression of 162 genes -identified in LRTs to exhibit significant and high likelihoods of -non-linear variation in logCPM across pCO\(_{2}\), appears distinct from our first -plot of non-linear expression deemed significant by edgeR. The overall -trends appear to be more convex or concave, consistent with a -second-order polynomial regressions. This speaks to the strength of LRTs -in identifying probable, non-linear effects or other multifactorial -effects.
-Let’s see how many non-linear transcripts identified in LRTs -exhibited significant non-linear effects within our edgeR models. To do -this, we’ll filter down the significant DEGs output by edgeR to include -gene id’s within our significant LRT output. Then, we’ll plot the -expression of this filtered set across pCO\(_{2}\).
-# Filter down df for gene id's exhibit pCO2 significant effect in edgeR and significant likelihood of nl effect in LRT
-edgeR_LRT_df <- filter(tab_exp_avg, geneid %in% lrt_filt$geneid & pCO2_sig == "Yes" |
- geneid %in% lrt_filt$geneid & pCO2_2_sig == "Yes")
-
-# Plot
-ggplot(data = edgeR_LRT_df,
- aes(x = pCO2, y = value)) +
- geom_path(
- alpha = .25,
- size = 0.25,
- stat = "identity",
- aes(group = as.factor(geneid))
- ) +
- facet_grid(pCO2_2_binom~time) +
- geom_smooth(method = "loess", se = TRUE, span = 1) +
- theme_classic() +
- theme(strip.background = element_blank()) +
- labs(y = "Avg logCPM", title = "Non-linear changes in GE output by edgeR & LRTs")
-
-# Count how many gene id's exhibit pCO2 significant effect in edgeR and significant likelihood of nl effect in LRT... 89 genes
-nrow(as.data.frame(unique(edgeR_LRT_df$geneid)))
-## [1] 89
-89 genes exhibited a significant effect of pCO\(_{2}\) or pCO\(_{2}\)^2 and a high likelihood of a -non-linear as identified by an LRT. This number is substantially lower -than the tens of thousands of genes identified by edgeR alone and the -162 identified via LRT. Additionally, there only appears to be partial -overlap between edgeR and LRTs, speaking to the difference in the -information you can extract from a significant effect identified in a -generalized linear model (e.g., edgeR) vs. likelihood-based approaches -such as an LRT. Because these two tests provide distinct and valuable -information, the most robust estimate of we have laid out for non-linear -effects of pCO\(_{2}\) on gene -expression comes from filtering for significant DEGs in edgeR -and significant LRTs.
-What combination of approaches are most informative ultimately -depends on your experimental design and the question you are addressing. -For example, we can explore an additional question that our previous -graphs did not touch on: how does gene expression vary between exposure -times for genes exhibiting significant interactions between -pCO\(_{2}\)^2 and time? Let’s -produce an exploratory plot of non-linear expression across 0.5 and 7 -days of exposure for such genes identified using edgeR alone.
-# Export diff expression data for transcripts with significant DE associated with interaction between PCO2^2 and time
-nl_pCO2_2_int_sig <- topTags(nl_pCO2_2_int, n = (390 + 271), adjust.method = "BH",p.value = 0.05)
-nl_pCO2_2_int_sig_geneids <- row.names(nl_pCO2_2_int_sig) #Output a list of geneids associated with sig PCO2^2 x time interaction
-
-# Filter down df for gene id's exhibit pCO2 significant effect in edgeR and significant likelihood of nl effect in LRT
-edgeR_interaction_df <- filter(tab_exp_avg, geneid %in% nl_pCO2_2_int_sig_geneids )
-edgeR_interaction_df$gene_id_time <- paste(edgeR_interaction_df$geneid,
- edgeR_interaction_df$time,
- sep = "_")
-
-# Average logCPM across different groups according to pCO2^2 estimate and time
-edgeR_interaction_avg <- summarySE(measurevar = "value",
- groupvars = c("time", "pCO2", "pCO2_2_binom"),
- data = edgeR_interaction_df)
-
-# Plot
-ggplot(data = edgeR_interaction_avg,
- aes(
- x = pCO2,
- y = value,
- color = as.factor(time),
- group = as.factor(time)
- )) +
- geom_path(stat = "identity") +
- geom_errorbar(aes(ymin = value - se, ymax = value + se), width = 0) +
- geom_point() +
- facet_wrap(~pCO2_2_binom) +
- theme_classic() +
- theme(strip.background = element_blank()) +
- labs(y = "logCPM", color = "Time (days)", title = "Interactions between pCO2^2 and exposure time")
-
-After visualizing variation in expression across pCO\(_{2}\) for genes identified in edgeR to be -affected by interactions between pCO\(_{2}\)^2 and exposure time, we can see -interesting changes in gene expression curves across time. For example, -genes that exhibited a convex expression curve at 0.5 days of exposure -generally modified their curve such that their expression became -positively exponential relative to pCO\(_{2}\). One potential explanation for this -change is that the upregulation of some genes under intermediate -pCO\(_{2}\) was reduced -following acclimation, while less acclimation may have occurred at the -higher pCO\(_{2}\) level after -7 days. Genes with a concave expression curves across pCO\(_{2}\) at 0.5 days of exposure exhibited a -marked decrease in expression as the experiment progressed, regardless -of pCO\(_{2}\) level. Thus, -the interaction between pCO\(_{2}\)^2 and time may affect ‘convex’ and -‘concave’ genes in entirely different ways, underscorign the importance -of partitioning or binning your data by parameters such as -pCO\(_{2}\)^2. Exploring your -data in this way is important for understanding interactive effects in -RNA-seq analyses, regardless of whether or not you are fitting -non-linear or linear continuous effects. Below we discuss further -considerations for modeling interactions between linear continuous -predictors.
-Interactive effects shaping gene expression are common in nature and -are becoming increasingly prevalent in models of gene expression derived -from experimental studies. Below, we go a little bit deeper on -interactive effects by outlining methods for fitting them using -categorical and continuous variables in models of expression. We provide -examples in edgeR, DESeq2, and Voom and point out how specification for -interactive effects differs between these packages. Lastly, we compare -correlations between these programs’ fold change (logFC) predictions and -test statistics. This final comparison of model predictions for -interaction parameters is as much a tutorial on interactive effects as -it is a tutorial on how to visualize differences in DE packages’ -results. Please use section of the code for any comparisons between DE -packages you may want to make for other effect types, not just -interactive effects!
-edgeR and Voom both take the same syntax for interactive effects, -which we define below using the model.matrix() function as ‘design_multi -<- model.matrix( ~1 + pCO2 + pCO2:day )’. Then, we will use this -multifactorial model design in both edgeR and Voom as specified -below:
-# Fit multifactoria design matrix that includes an interaction term for pCO2 x day
-design_multi <-
- model.matrix(~ 1 + pCO2 + pCO2:day) #Generate multivariate edgeR glm
-
-# Fit quasi-likelihood, neg binom linear regression
-multi_fit <-
- glmQLFit(y1, design_multi) # Fit multivariate model to counts
-
-# Test for effect of pCO2
-tr_pCO2 <-
- glmQLFTest(
- multi_fit,
- coef = 2,
- contrast = NULL,
- poisson.bound = FALSE
- ) # Estimate significant DEGs
-
-is.de_tr_pCO2 <-
- decideTestsDGE(tr_pCO2, adjust.method = "fdr", p.value = 0.05) # Make contrasts
-
-summary(is.de_tr_pCO2)
-## pCO2
-## Down 0
-## NotSig 62579
-## Up 0
-# Test for interaction between pCO2 and time
-tr_int <-
- glmQLFTest(multi_fit, coef = 3, poisson.bound = FALSE) # Estimate significant DEGs
-
-is.de_int <-
- decideTestsDGE(tr_int, adjust.method = "fdr", p.value = 0.05) # Make contrasts
-
-summary(is.de_int)
-## pCO2:day
-## Down 1021
-## NotSig 60821
-## Up 737
-Below we will fit the same ‘design_multi’ model (y ~ 1 + -pCO\(_{2}\) + -pCO\(_{2}\):days) to our read -counts using Voom rather than edgeR. As stated earlier, Voom fits -interactions between fixed effects using the same syntax as edgeR, so -the ‘design_multi’ object can be used by both programs.
-What is different between edgeR and Voom is the manner by -which Voom accounts for variation between replicates. Rather than using -gene-wise or averaged estimations of biological coefficients of -variation and inputting these values as \(\theta\) in a negative binomial GLM, Voom -models what is referred to as the mean-variance relationship and -incorporates this relationship within a linear model rather than a GLM -by “weighting” the accuracy of a given observation based on its level of -expression and modelled variance. This mean-variance relationship can be -plotted by Voom after model fitting as we show below:
-# Perform Voom transformation
-Voom <- voom(y, design_multi, plot = T)
-
-When deciding between packages for testing DGE, it can be helpful to -compare model assumptions regarding variance by plotting graphs like -Voom’s mean-variance relationship above and the output of edgeR’s -‘plotBCV()’ function, which we provided an example of earlier and will -replot below for the sake of making thsi visual comparison. Take note -that the y-axes of these variance plots in Voom (square root of standard -deviation) and edgeR (biological coefficient of variation) are -different:
-The mean-variance relationships modelled in edgeR and Voom are -generally similar, but it appears that Voom assumes even lesser variance -in expression among genes at the highest expression level. One result of -this may be that greater statistical power is assumed in tests of -differential expression by Voom for genes with high expression relative -to low expression genes. Keep this in mind as we continue to move -forward with fitting parameters for interactions between -pCO\(_{2}\) and time in -Voom.
-# Fit using Voom
-lm_Voom_fit <- lmFit(Voom, design_multi)
-
-# Create a contrast across continuous pCO2 variable
-cont_pCO2 <- contrasts.fit(lm_Voom_fit, coef = "pCO2")
-
-# Create a contrast across interaction etween continuous pCO2 and time variables
-cont_pCO2_day <- contrasts.fit(lm_Voom_fit, coef = "pCO2:day")
-
-# Perform empirical Bayes smoothing of standard errors
-cont_pCO2 <- eBayes(cont_pCO2)
-cont_pCO2_day <- eBayes(cont_pCO2_day)
-
-# Output test statistics
-pCO2_results <-
- topTable(cont_pCO2,
- coef = "pCO2",
- adjust.method = "fdr",
- n = Inf)
-pCO2_day_results <-
- topTable(cont_pCO2_day,
- coef = "pCO2:day",
- adjust.method = "fdr",
- n = Inf)
-
-# How many DEGs are associated with pCO2
-length(which(pCO2_results$adj.P.Val < 0.05))
-## [1] 0
-# How many DEGs are associated with pCO2:day?
-length(which(pCO2_day_results$adj.P.Val < 0.05))
-## [1] 1628
-Voom has identified 1628 genes whose variation is affected by an -interaction between pCO\(_{2}\) and time, compared to 1748 -identified by edgeR.
-DESeq2 requires a slightly different syntax for specifying -interactive effects compared to edgeR and Voom.
-gcounts <- as.data.frame(data_input)
-
-totalCounts <- colSums(gcounts)
-
-### REMOVE GENES WITH LOW MEAN COUNTS ###
-
-# Make a DGEList object for edgeR
-y <- DGEList(counts = data_input, remove.zeros = TRUE)
-
-# Let's remove samples with less then 0.5 cpm (this is ~10 counts in the count file) in fewer then 9/12 samples
-keep_g <- rowSums(cpm(gcounts) > .5) >= 9
-
-table(keep_g)
-## keep_g
-## FALSE TRUE
-## 20632 62579
-# Set keep.lib.sizes = F and recalculate library sizes after filtering
-#gcounts <- gcounts[ keep_g, keep.lib.sizes = FALSE ]
-
-### BUILD A DATAFRAME ASSOCIATING SAMPLE NAMESWITH TREATMENT CONDITIONS ###
-targets
-## pCO2 day treatment grouping
-## 1 255 7.0 B 255.7
-## 2 255 7.0 B 255.7
-## 3 255 7.0 B 255.7
-## 4 255 0.5 B 255.0.5
-## 5 255 0.5 B 255.0.5
-## 6 255 0.5 B 255.0.5
-## 7 530 7.0 R 530.7
-## 8 530 7.0 R 530.7
-## 9 530 7.0 R 530.7
-## 10 530 0.5 R 530.0.5
-## 11 530 0.5 R 530.0.5
-## 12 530 0.5 R 530.0.5
-## 13 918 7.0 Y 918.7
-## 14 918 7.0 Y 918.7
-## 15 918 7.0 Y 918.7
-## 16 918 0.5 Y 918.0.5
-## 17 918 0.5 Y 918.0.5
-## 18 918 0.5 Y 918.0.5
-### WALD TEST - FULL MODEL ###
-
-dds <- DESeqDataSetFromMatrix(gcounts,
- colData = targets,
- design = formula( ~ 1 + pCO2 + day : pCO2))
-
-rld <- rlog(dds)
-rld.df <- assay(rld)
-
-# Wald test for pCO2:day
-dds_int <- DESeq(dds, minReplicatesForReplace = Inf)
-
-design <- design(dds_int)
-
-DESeq2_int_result_names <- resultsNames(dds_int)
-
-# Count DEGs due to interaction
-DESeq2_int_results <- results(dds_int, name = "pCO2.day", lfcThreshold = 0, alpha = 0.05)
-
-summary(DESeq2_int_results)
-##
-## out of 81450 with nonzero total read count
-## adjusted p-value < 0.05
-## LFC > 0 (up) : 2491, 3.1%
-## LFC < 0 (down) : 2494, 3.1%
-## outliers [1] : 1373, 1.7%
-## low counts [2] : 2724, 3.3%
-## (mean count < 5)
-## [1] see 'cooksCutoff' argument of ?results
-## [2] see 'independentFiltering' argument of ?results
-DESeq2 has identified 4985 genes whose variation is affected by an -interaction between pCO\(_{2}\) and time, compared to 1748 -identified by edgeR and 1628 by Voom. This is a ~3-fold greater number -of DEGs predicted by DESeq2 compared to edgeR and Voom.
-Differential expression packages can make dramatically different -predictions for both the fold-change and probability of differential -expression. If you find yourself deciding between different packages or -wanting to compare how conservative different approaches are, it helps -to run regressions of model predictions by different packages. Below we -apply the ggpairs() function from GGAlly (Schloerke et al. 2018) to plot a correlation -matrix of logFC values and negative log-transformed p-values attributed -to interactive effects between time and pCO\(_{2}\) predicted by edgeR, DESeq2, and -Voom:
-# Merge logFC and pval data from each program
-Voom_edgeR_deseq_int_comp <- merge(
- merge(
- data.frame(geneid = row.names(pCO2_day_results),
- Voom_logFC = pCO2_day_results$logFC,
- Voom_pval = pCO2_day_results$P.Value),
- data.frame(geneid = row.names(tr_int$table),
- edgeR_logFC = tr_int$table$logFC,
- edgeR_pval = tr_int$table$PValue),
- by = "geneid" ),
- data.frame(geneid = row.names(DESeq2_int_results),
- DESeq2_logFC = DESeq2_int_results$log2FoldChange,
- DESeq2_pval = DESeq2_int_results$pvalue),
- by = "geneid")
-
-# Create neg log pvalues
-Voom_edgeR_deseq_int_comp$Voom_neglogp <- -log(Voom_edgeR_deseq_int_comp$Voom_pval)
-Voom_edgeR_deseq_int_comp$edgeR_neglogp <- -log(Voom_edgeR_deseq_int_comp$edgeR_pval)
-Voom_edgeR_deseq_int_comp$DESeq2_neglogp <- -log(Voom_edgeR_deseq_int_comp$DESeq2_pval)
-
-# Create function for adding 1:1 trendline on ggpairs plot and plotting geom_hex instead of geom_point
-my_fn <- function(data, mapping, ...){
- p <- ggplot(data = data, mapping = mapping) +
- geom_hex( bins = 100,
- aes(fill = stat(log(count))), alpha = 1 ) +
- scale_fill_viridis_c() +
- geom_abline(
- slope = 1,
- intercept = 0,
- color = "red",
- lty = 2,
- size = 1,
- alpha = 0.5
- )
- p
-}
-
-# Correlation matrix of pvalues
-pval_pairs <- ggpairs(data = Voom_edgeR_deseq_int_comp,
- columns = c(8, 9, 10),
- mapping = aes(alpha = 0.001),
- lower = list(continuous = my_fn)) +
- labs(title = "Correlation matrix: interaction p-values")
-
-pval_pairs +
- theme_classic(base_rect_size = 0)
-
-The above plot is the output of ggpairs() showing correlations -between negative log-transformed p-values attributed to interactive -effects of time x pCO\(_{2}\) -estimated between each of the three packages, as well as the -distributions of this parameter for each package, and correlation -statistics. Red dashed lines represent a slope equal to 1. Yellow colors -depict regions of each linear regression with more observations.
-Voom and edgeR share the largest correlation between p-values -estimated for an interactive effect of pCO\(_{2}\) : days on differential expression -(R\(^2\) = 0.965). The bottom row of -this ggpairs plot shows tranformed p-values from DESeq2 on the y-axis, -which appear to be skewed toward lower p-values/higher -log p-values. -This is consistent with the fact that we observed ~3-fold more DEGs -attributed to the pCO\(_{2}\) -: days interaction when running DESeq2 compared to edgeR and Voom.
-Next, let’s create a ggpairs() plot contrasting logFC estimates of DE -attributed to pCO\(_{2}\) : -days in edgeR, Voom, and DESeq2. One might think that logFCs should -demonstrate stronger correlations between programs compared to p-values, -but due to the manners by which read counts are normalized and -transformed in different packages this is not always the case!
-# Correlation matrix of logFC's
-logFC_pairs <- ggpairs(data = Voom_edgeR_deseq_int_comp,
- columns = c( 2, 4, 6 ),
- mapping = aes( alpha = 0.001 ),
- lower = list(continuous = my_fn)) +
- labs(title = "Correlation matrix: interaction logFC's")
-
-logFC_pairs +
- theme_classic(base_rect_size = 0)
-
-In this pairs plot of logFC estimates across packages, you can see -that R\(^2\) values are, on average, -lower than those of p-value correlations between packages. Once again, -the strongest correlation exists between edgeR and Voom. However, the -slope of this correlation is less than 1, indicating that absolute edgeR -logFC’s are lower on average than absolute Voom logFCs. The correlation -pair that shows a slope closest to 1.0 is edgeR vs. DESeq2. edgeR and -DESeq2 normalize read counts in a similar manner, and thus a value -derived from differences in read counts across continuous predictors -such as logFC is likely to be more similar between these packages.
-We will skip edgeR and DESeq2 since they cannot fit random -effects.
-Random intercepts (often referred to as random effects) can be an -essential component of estimating differential expression in experiments -that incorporate non-independent observations (e.g., repeated measures -and nested designs: individuals within populations or families) or -randomly sampled groups. Most differential expression packages do not -possess functionality for fitting random intercepts, extending GLMs to -generalized linear mixed models or GLMMs, but Voom does! Additional -packages that we don’t cover here also possess random intercept -functionality, including dream (Hoffman and -Roussos 2021) and glmmSeq.
-Below is an example of how to fit a random intercept associated with -the time variable of our dataset using Voom:
-# Fit multifactoria design matrix
-design_rand <- model.matrix(~ 1 + pCO2 + (1 | day)) #Generate multivariate edgeR glm
-
-# Perform Voom transformation
-Voom_rand <- voom(y, design_rand, plot = T)
-## Coefficients not estimable: 1 | dayTRUE
-
-If you compare the mean-variance plot from our mixed model above, -with our first mean-variance plot from the Voom model that did not -include a random effect you will notice that the modelled relationship -between variance and mean expression is considerably different between -the two. While the linear model mean-variance plot took on a negative -exponential shape by which earlier increases in expression level -associate with a precipitous decline in variance, our mixed model’s fit -takes on a more logarithmic shape and the raw points appear parabolic. -Here, variance is modelled as remaining somewhat constant during early -increases in expression level before declining more rapidly across -intermediate expression levels. Additionally, Voom predicted a maximum -variance of 2.0 in our linear model while the mixed model above predicts -a maximum of 1.5. This reduction in overall variance may partially be -explained by incorporating a random intercept for time rather than -modelling it as a continuous fixed effect. As we progress through the DE -test derived this mixed model, we will anticipate that a reduction in -variance will influence our results, likely by increasing the -statistical power of genewise DE tests, and that modelling time as a -random effect will also modify what and how many genes are identified as -differentially expressed.
-# Fit using Voom
-lm_Voom_fit_rand <- lmFit(Voom_rand, design_rand)
-## Coefficients not estimable: 1 | dayTRUE
-# Create a contrast across continuous pCO2 variable
-cont_rand_day <- contrasts.fit(lm_Voom_fit_rand, coef = "pCO2")
-
-# Perform empirical Bayes smoothing of standard errors
-cont_rand_day <- eBayes(cont_rand_day)
-
-# Output test statistics
-rand_results <- topTable(cont_rand_day,
- coef = "pCO2",
- adjust.method = "fdr",
- n = Inf)
-
-# How many DEG are associated with pCO2 after incorporating a random effect for day?
-length(which(rand_results$adj.P.Val < 0.05)) # number of DE genes
-## [1] 3
-While the linear Voom model’s DE test output 0 predicted DEG’s -associated with pCO\({_2}\), -the mixed Voom model predicted 3 genes. As we saw eariler when fitting -linear models, the majority of DEG’s in this study were derived from an -interaction between time and pCO\({_2}\). We modelled this as an interaction -between two continuous fixed effects, but random effects and continuous -effects can also jointly influence outome variables. Such effects can be -fit in mixed models using ‘random slopes’, which we describe in the next -section.
-Distinct from the random intercept, testing for variation in gene -expression attributed to a random slope effect asks, “Within a set of -groups, does the reaction norm of gene expression across a continuous -predictor randomly by group?”. This question is relevant to experimental -designs testing for differences in the plasticity of gene expression -across randomly-selected demographics such as genotypes or families. For -these kinds of groups, modelling an interaction between the continuous -predictor and a categorical effect representing different groups would -be inappropriate, as the use of a categorical effect should be reserved -for groups-of-interest specific selected in an experiment (e.g., -different sexes, populations inhabiting distinct environments, etc.). To -our knowledge however, there is no well-documented differential -expression package that enables the fitting of random slopes.
-Below we detail custom scripts for fitting gaussian linear mixed -models using lme4 and negative binomial GLMs to read count data. For -practical purposes, we will treat exposure time as a random variable -even though timepoints were not randomly selected in the experiment that -produced our data and apply. We will also apply linear mixed models -using lmer, which is computationally intensive. The models fitted below -are input with a highly filtered and reduced set of read count data. -After fitting a LMM to each gene with a random slope parameter by which -the effect of pCO\({_2}\) -randomly varies across time, we will compare the likelihoods of these -models to those of a ‘null’ LMM that only possesses a random intercept -associated with time. We will then count and report the number of genes -that show significantly high likelihoods for random slopes.
-## For practical purposes, reduce size of input data in order to run lmer in for loop without using up memory
-# Re-filter data
-keep_red <- rowSums(cpm(y) > 3 & cpm(y) < 10) >= 12
-
-# Apply read filter to tab_exp_df
-tab_exp_df_filt <- filter(tab_exp_df, geneid %in%
- row.names(filter(as.data.frame(keep_red), keep_red == TRUE)))
-
-# Using dlply, fit linear mixed model to tabularized df of log2-transformed CPM values for each transcript
-tab_exp_df_filt$time <- as.factor(tab_exp_df_filt$time)
-rs_lmes <- dlply(tab_exp_df_filt, c("geneid"), function(df)
-lmer(value ~ pCO2 + (1 | time) + (pCO2 | time), data = df))
-
-# Fit null model without random slope
-null_lmes <- dlply(tab_exp_df_filt, c("geneid"), function(df)
-lmer(value ~ pCO2 + (1 | time), data = df))
-
-# Apply LRT to each gene to test for effect of random slope
-lmer_lrts <- list() # Create list to add LRT results to
-
-for (i in 1:length(rs_lmes)) {
- lmer_lrts[[i]] <- lrtest(rs_lmes[[i]], null_lmes[[i]]) # Apply LRTs with for loop
-}
-
-## Filter lrt results for transcripts with significantly higher likelihoods than null model
-lmer_lrt_dfs <- list()
-
-# Turn list of LRT outputs into list of dataframes containing output info
-for (i in 1:length(lmer_lrts)) {
- lmer_lrt_dfs[[i]] <- data.frame(lmer_lrts[i])
-}
-
-# Create singular dataframe with geneids and model outputs for chi-squared and LRT p-value
-lmer_lrt_coeff_df <- na.omit(bind_rows(lmer_lrt_dfs, .id = "column_label")) # na.omit removes first row of each df, which lacks these data
-
-# Add geneid based on element number from original list of LRT outputs
-lmer_lrt_coeff_df <- merge(lmer_lrt_coeff_df,
- data.frame(geneid = names(rs_lmes),
- column_label = as.character(seq(length(rs_lmes)))),
- by = "column_label")
-
-# Apply FDR adjustment to LRT p-values before filtering for sig non-linear effects
-lmer_lrt_coeff_df$FDR <- p.adjust(lmer_lrt_coeff_df$Pr..Chisq., method = "fdr")
-
-# Filter LRT results for sig FDR coeff...
-lmer_lrt_filt <- filter(lmer_lrt_coeff_df, FDR < 0.05)
-
-# How many genes showed significant LRT?... none!
-count(lmer_lrt_filt)
-## [1] column_label X.Df LogLik Df Chisq
-## [6] Pr..Chisq. geneid FDR freq
-## <0 rows> (or 0-length row.names)
-If you’ve made it this far, you probably found it to be a common -theme that popular DE packages do not provide all of the flexibility -necessary for fitting models to data coming from multifactorial -experimental designs, even some of the more standard designs. It should -be restated that our walkthrough only scratches the surface of (i) -available packages that can facilitate multifactorial DE tests and (ii) -opportunities that custom scripts for model fitting provide. For -example, our example code for fitting linear and linear mixed models -before testing for effects using likelihood ratio tests can easily be -edited to fit negative binomial GLMs such as those used by edgeR and -DESeq2 input with \(\theta\) estimates -(e.g., dispersion) from edgeR or custom estimates, and we encourage -readers to try out this strategy! One useful resource for doing so is -(Rocke et al. 2015), a preprint describing -transformations of RNA-seq data that enable model fitting using the -‘MASS’ function glm.nb(). If you have personal code that you believe -would make a valuable addition to this walkthrough, don’t hesitate to -post a GitHub issue or submit a pull-request to us. We will review the -suggestions and add your name to the list this page’s authors! We hope -that this page and others on the website can serve as a living document -that can keep pace with advances in multifactorial RNA-seq analyses, -which lend fidelity to experimental designs that are necessary for -generating reproducible and predictive results.
-