diff --git a/Makefile b/Makefile index e3dab54..a55bf92 100644 --- a/Makefile +++ b/Makefile @@ -3,7 +3,7 @@ all: Workflow.pdf IMAGE=alanocallaghan/basicsworkflow2020-docker -VERSION=0.4.0 +VERSION=0.5.3 %.pdf: %.Rmd figure/Workflow-Overview.pdf figure/Workflow-Schematic.pdf docker run -v $(shell pwd):/home/rstudio/mycode \ diff --git a/Workflow.Rmd b/Workflow.Rmd index 6805bb7..13f2bca 100644 --- a/Workflow.Rmd +++ b/Workflow.Rmd @@ -6,7 +6,7 @@ author: - &MRC MRC Human Genetics Unit, Institute of Genetics \& Cancer, University of Edinburgh, Western General Hospital, Crewe Road, Edinburgh, EH4 2XU, UK - email: "a.b.o'callaghan@sms.ed.ac.uk" + email: "alan.ocallaghan@outlook.com" - name: Nils Eling affiliation: - &UZH Department of Quantitative Biomedicine, University of Zurich, @@ -174,6 +174,9 @@ The latter enables the quantification of a gene-level *residual over-dispersion* as a measure of transcriptional noise that is not confounded by differences in gene expression. Furthermore, when available, `r Biocpkg("BASiCS")` can also leverage extrinsic spike-in molecules to aid data normalisation. +A previous study has shown that BASiCS, when used as a generative +model, is well-tuned to capture and recapitulate the main properties of +typical scRNAseq data using posterior predictive checks [@Zappia2017]. This article complements existing scRNA-seq workflows based on the Bioconductor @@ -299,7 +302,8 @@ where less information is available. A high-level overview for the model implemented in `r Biocpkg("BASiCS")` is shown in Figure \@ref(fig:basics-schematic), -and a more detailed description is provided in [TODO: REPO LINK]. +and a more detailed description is provided in the `ExtendedMethods` document +in the Zenodo repository for this manuscript (see [Data availability](#data-availability)). Model parameters are designed to capture different aspects of scRNAseq data. First, *nuisance* parameters are used to capture technical artifacts. This includes cell-specific parameters (indexed by @@ -333,8 +337,12 @@ more reliable estimates. The global trend is then used to derive gene-specific *residual over-dispersion* parameters $\epsilon_i$ that are not confounded by mean expression. Similar to the DM values implemented in `r Biocpkg("scran")`, these are defined as deviations with respect to the overall trend. +These residual over-dispersion values can be used to quantify transcriptional +heterogeneity without any confounding with mean expression, though +it should be noted that they may also represent structural heterogeneity +resulting from heterogeneous cell types within the population being studied. -```{r basics-schematic, echo = FALSE, out.width="\\textwidth", fig.cap="TODO A schematic representation of the features of the BASiCS model."} +```{r basics-schematic, echo = FALSE, out.width="\\textwidth", fig.cap="A schematic representation of the features of the BASiCS model. BASiCS accounts for cell-to-cell and batch-to-batch variability using nuisance parameters. Accounting for cell-to-cell and batch-to-batch technical variability allows BASiCS to robustly perform gene-level inference of mean and variability. Furthermore, by accounting for the association between mean and variability in scRNAseq, BASiCS can also infer transcriptional noise using the residual variability parameter $\\epsilon_i$."} knitr::include_graphics("figure/Workflow-Schematic.pdf", auto_pdf = TRUE) ``` @@ -392,7 +400,7 @@ To illustrate the use of `r Biocpkg("BASiCS")` in this context, we use droplet-based scRNAseq generated using the 10X microfluidics platform. An example using a dataset generated using a plate-based protocol including spike-ins is provided in the "AnalysisCD4T.Rmd" document of the -Zenodo repository [TODO: ADD-REF]. +Zenodo repository for this manuscript (see [Data availability](#data-availability)). In particular, we focus on the data generated by @Ibarra-Soria2018 which contains expression profiles for 20,819 cells collected from E8.25 mouse embryos. These data comprise a highly heterogeneous population of cells with @@ -403,7 +411,7 @@ The authors also provide additional information summarising the experimental design (animal of source for each cell) and the cluster labels generated by in their analysis. The code required to download these data is provided in the "DataPreparationBASiCSWorkflow.Rmd" document of -the Zenodo repository [TODO: ADD-REF]. +the Zenodo repository for this manuscript (see [Data availability](#data-availability)). As explained above, when analysing a heterogeneous population of cells, a clustering analysis is required to identify the cell types that drive this @@ -421,7 +429,7 @@ While quality control and pre-processing is an important topic in scRNAseq analysis, we feel that it has been covered in detail in other articles. For this analysis, we have performed quality control, pre-processing and exploratory data analysis in the "DataPreparationBASiCSWorkflow.Rmd" document of -the Zenodo repository [TODO: ADD-REF]. +the Zenodo repository for this manuscript (see [Data availability](#data-availability)). The code presented there needs to be run before running the rest of this workflow. Here, we load the pre-processed data. This consists of two separate `SingleCellExperiment` data objects: one for presomitic and one for @@ -429,9 +437,7 @@ somitic mesoderm cells. ```{r SCE-load} # Website were the files are located -chains_website <- "https://gitlab.com/alanocallaghan/basicsworkflow2020/-/raw/master/" -# chains_website <- "https://git.ecdf.ed.ac.uk/vallejosgroup/basicsworkflow2020/-/raw/master/" -# chains_website <- "https://zenodo.org/record/5243265/files/" +chains_website <- "https://zenodo.org/record/10251224/files/" # To avoid timeout issues as we are downloading large files options(timeout = 1000) # File download @@ -536,6 +542,7 @@ Extra parameters can be used to store the output (`StoreChains`, `StoreDir`, `RunName`) and to monitor the progress of the algorithm (`PrintProgress`). + Here, we run the MCMC sampler separately for somitic and pre-somitic mesoderm cells. We use 20,000 iterations (`N`), discarding the initial 10,000 iterations as burn-in, (`Burn`), and saving parameter values only once in each 10 @@ -582,33 +589,32 @@ chain_psm <- BASiCS_MCMC( This first of these chains takes 140 minutes to complete on a 3.4 GHz Intel Core i7 4770k procesor with 32GB RAM, while the second takes 159 minutes. For convenience, these can be obtained online at -[https://doi.org/10.5281/zenodo.5243265](https://doi.org/10.5281/zenodo.5243265). +[https://doi.org/10.5281/zenodo.10251224](https://doi.org/10.5281/zenodo.10251224). ```{r download-chain-SM} # Website were the files are located -chains_website <- "https://gitlab.com/alanocallaghan/basicsworkflow2020/-/raw/master/" -# chains_website <- "https://zenodo.org/record/5243265/files/" +chains_website <- "https://zenodo.org/record/10251224/files/" # To avoid timeout issues as we are downloading large files options(timeout = 1000) # File download ## The code below uses `file.exists` to check if files were previously downloaded ## After download, files are then stored in an `rds` sub-folder -if (!file.exists("rds/chain_SM.Rds")) { +if (!file.exists("rds/chain_sm.Rds")) { download.file( - paste0(chains_website, "rds/chain_SM.Rds"), - destfile = "rds/chain_SM.Rds" + paste0(chains_website, "rds/chain_sm.Rds"), + destfile = "rds/chain_sm.Rds" ) } -if (!file.exists("rds/chain_PSM.Rds")) { +if (!file.exists("rds/chain_psm.Rds")) { download.file( - paste0(chains_website, "rds/chain_PSM.Rds"), - destfile = "rds/chain_PSM.Rds" + paste0(chains_website, "rds/chain_psm.Rds"), + destfile = "rds/chain_psm.Rds" ) } ## Loads files after download -chain_sm <- readRDS("rds/chain_SM.Rds") -chain_psm <- readRDS("rds/chain_PSM.Rds") +chain_sm <- readRDS("rds/chain_sm.Rds") +chain_psm <- readRDS("rds/chain_psm.Rds") ``` The output from `BASiCS_MCMC` is a `BASiCS_Chain` object that contains the @@ -659,7 +665,8 @@ For illustration purposes, here we focus on the chains generated for the somatic mesoderm cells. However, it is important to analyse convergence for all MCMC chains; we have omitted this analysis for pre-somitic mesoderm cells in the present manuscript for brevity, but they can be viewed -in the [TODO: ZENODO REPO]. +in the data preparation document shown in the Zenodo repository for +this manuscript (see [Data availability](#data-availability)). Traceplots can be used to visually assess the history of MCMC iterations for a specific parameter (e.g. Figure \@ref(fig:convergence-SM), left panel). @@ -687,7 +694,7 @@ Here, we illustrate usage for two such metrics focusing on the MCMC chain that was obtained for the somitic mesoderm cells (similar results were obtained for pre-somitic mesoderm cells in the "DataPreparationBASiCSWorkflow.Rmd" document of -the Zenodo repository [TODO: ADD-REF]. +the Zenodo repository (see [Data availability](#data-availability)). First, we focus on the diagnostic criterion proposed by Geweke [@Geweke1995]. The latter compares the average of draws obtained during the initial (10\% after burn in, by default) and the final part of the chain (50\% by default) @@ -734,7 +741,6 @@ large Geweke diagnostic values or low effective sample sizes in a certain dataset, then caution should be applied when interpreting the results of the model. These issues can often be addressed by more stringent filtering of genes and cells before performing inference. -TODO: this is worse than the equivalent with EB prior, mention in that doc ```{r mcmc-diag, fig.cap="MCMC diagnostics for gene-specific mean expression parameters; somitic mesoderm cells. A: Geweke Z-score for mean expression parameters is plotted against mean expression estimates. Dashed lines represent absolute Z-scores of 3, outside of which we advise caution when interpreting results. B: Effective sample size (ESS) is plotted against mean expression estimates. A dashed line shows a threshold of 100, below which we advise caution when interpreting results."} diag_p1 <- BASiCS_DiagPlot(chain_sm, Param = "mu", Measure = "geweke") + @@ -861,7 +867,7 @@ estimates (Figure \@ref(fig:dispersion-vs-mean)D, Pearson's correlation equal to `r with(parameter_df, cor(DM, residualoverdisp))`) but the residual variability estimates generated for these data are less consistent than what we have observed in the context of less sparse data (see the "AnalysisCD4T.Rmd" -document in the [TODO: ADD-REF] Zenodo repository). +document in the Zenodo repository for this manuscript (see [Data availability](#data-availability)). For the majority of genes (`r with(parameter_df[sel_genes,], sum(sign(DM) == sign(residualoverdisp)))` out of `r sum(sel_genes)` @@ -1033,9 +1039,7 @@ This section highlights the use of `r Biocpkg("BASiCS")` to perform differential expression tests for mean and variability between different pre-specified populations of cells and experimental conditions. Here, we compare the somitic mesoderm cells, analysed in the previous section, -to pre-somitic mesoderm cells analysed in the same study -TODO: ref, explain comparison. - +to pre-somitic mesoderm cells analysed in the same study. Differential expression testing is performed via the `BASiCS_TestDE` function. The main input parameters are @@ -1686,13 +1690,13 @@ sessionInfo() ``` -# Data availability +# Data availability {#data-availability} The data used in this article are available for download on EBI ArrayExpress under accession number [E-MTAB-4888](https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-4888/). The MCMC chains used to generate this article can be found in Zenodo -under the DOI [10.5281/zenodo.5243265](https://doi.org/10.5281/zenodo.5243265). +under the DOI [10.5281/zenodo.10251224](https://doi.org/10.5281/zenodo.10251224). # Software availability diff --git a/Workflow.bib b/Workflow.bib index 2a5dadd..1e03f5e 100644 --- a/Workflow.bib +++ b/Workflow.bib @@ -1561,3 +1561,20 @@ @article{Lloyd-Smith2007 langid = {english}, file = {/home/alan/Zotero/storage/FUMMVMLI/Lloyd-Smith - 2007 - Maximum Likelihood Estimation of the Negative Bino.pdf} } +@article{Zappia2017, + title = {Splatter: Simulation of Single-Cell {{RNA}} Sequencing Data}, + shorttitle = {Splatter}, + author = {Zappia, Luke and Phipson, Belinda and Oshlack, Alicia}, + year = {2017}, + month = dec, + journal = {Genome Biology}, + volume = {18}, + number = {1}, + pages = {174}, + issn = {1474-760X}, + doi = {10.1186/s13059-017-1305-0}, + urldate = {2020-02-07}, + abstract = {As single-cell RNA sequencing (scRNA-seq) technologies have rapidly developed, so have analysis methods. Many methods have been tested, developed, and validated using simulated datasets. Unfortunately, current simulations are often poorly documented, their similarity to real data is not demonstrated, or reproducible code is not available. Here, we present the Splatter Bioconductor package for simple, reproducible, and well-documented simulation of scRNA-seq data. Splatter provides an interface to multiple simulation methods including Splat, our own simulation, based on a gamma-Poisson distribution. Splat can simulate single populations of cells, populations with multiple cell types, or differentiation paths.}, + langid = {english}, + file = {/home/alan/Zotero/storage/YLXBGQVH/Zappia et al. - 2017 - Splatter simulation of single-cell RNA sequencing.pdf} +} diff --git a/figure/Workflow-Schematic.pdf b/figure/Workflow-Schematic.pdf index 78364a5..10dacc5 100644 Binary files a/figure/Workflow-Schematic.pdf and b/figure/Workflow-Schematic.pdf differ