Bump Zenodo link

VallejosGroup · Dec 2, 2023 · b5cc61c · b5cc61c
1 parent d569c8a
commit b5cc61c
Show file tree

Hide file tree

Showing 4 changed files with 51 additions and 30 deletions.
diff --git a/Makefile b/Makefile
@@ -3,7 +3,7 @@
 all: Workflow.pdf
 
 IMAGE=alanocallaghan/basicsworkflow2020-docker
-VERSION=0.4.0
+VERSION=0.5.3
 
 %.pdf: %.Rmd figure/Workflow-Overview.pdf figure/Workflow-Schematic.pdf
 	docker run -v $(shell pwd):/home/rstudio/mycode \

diff --git a/Workflow.Rmd b/Workflow.Rmd
@@ -6,7 +6,7 @@ author:
     - &MRC MRC Human Genetics Unit, Institute of Genetics \& Cancer, 
       University of Edinburgh, Western General Hospital, Crewe Road, Edinburgh, 
       EH4 2XU, UK
-    email: "a.b.o'[email protected]"
+    email: "[email protected]"
   - name: Nils Eling
     affiliation: 
     - &UZH Department of Quantitative Biomedicine, University of Zurich,
@@ -174,6 +174,9 @@ The latter enables the quantification of a gene-level *residual over-dispersion*
 as a measure of transcriptional noise that is not confounded by differences in
 gene expression. Furthermore, when available, `r Biocpkg("BASiCS")` can also 
 leverage extrinsic spike-in molecules to aid data normalisation. 
+A previous study has shown that BASiCS, when used as a generative
+model, is well-tuned to capture and recapitulate the main properties of
+typical scRNAseq data using posterior predictive checks [@Zappia2017].
 
 <!--- Describes aims for the workflow and introduces BASiCS ---> 
 This article complements existing scRNA-seq workflows based on the Bioconductor 
@@ -299,7 +302,8 @@ where less information is available.
 
 A high-level overview for the model implemented in `r Biocpkg("BASiCS")` is
 shown in Figure \@ref(fig:basics-schematic),
-and a more detailed description is provided in [TODO: REPO LINK].
+and a more detailed description is provided in the `ExtendedMethods` document
+in the Zenodo repository for this manuscript (see [Data availability](#data-availability)).
 Model parameters are designed to capture different 
 aspects of scRNAseq data. First, *nuisance* parameters are used to 
 capture technical artifacts. This includes cell-specific parameters (indexed by 
@@ -333,8 +337,12 @@ more reliable estimates. The global trend is then used to derive gene-specific
 *residual over-dispersion* parameters $\epsilon_i$ that are not confounded by
 mean expression. Similar to the DM values implemented in `r Biocpkg("scran")`,
 these are defined as deviations with respect to the overall trend.
+These residual over-dispersion values can be used to quantify transcriptional
+heterogeneity without any confounding with mean expression, though
+it should be noted that they may also represent structural heterogeneity
+resulting from heterogeneous cell types within the population being studied.
 
-```{r basics-schematic, echo = FALSE, out.width="\\textwidth", fig.cap="TODO A schematic representation of the features of the BASiCS model."}
+```{r basics-schematic, echo = FALSE, out.width="\\textwidth", fig.cap="A schematic representation of the features of the BASiCS model. BASiCS accounts for cell-to-cell and batch-to-batch variability using nuisance parameters. Accounting for cell-to-cell and batch-to-batch technical variability allows BASiCS to robustly perform gene-level inference of mean and variability. Furthermore, by accounting for the association between mean and variability in scRNAseq, BASiCS can also infer transcriptional noise using the residual variability parameter $\\epsilon_i$."}
 knitr::include_graphics("figure/Workflow-Schematic.pdf", auto_pdf = TRUE)
 ```
 
@@ -392,7 +400,7 @@ To illustrate the use of `r Biocpkg("BASiCS")` in this context, we use
 droplet-based scRNAseq generated using the 10X microfluidics platform. An 
 example using a dataset generated using a plate-based protocol including 
 spike-ins is provided in the "AnalysisCD4T.Rmd" document of the
-Zenodo repository [TODO: ADD-REF].
+Zenodo repository for this manuscript (see [Data availability](#data-availability)).
 In particular, we focus on the data generated by @Ibarra-Soria2018 which
 contains expression profiles for 20,819 cells collected from E8.25 mouse 
 embryos. These data comprise a highly heterogeneous population of cells with 
@@ -403,7 +411,7 @@ The authors also provide additional information summarising the experimental
 design (animal of source for each cell) and the cluster labels generated by
 in their analysis. The code required to download these data is provided in 
 the "DataPreparationBASiCSWorkflow.Rmd" document of
-the Zenodo repository [TODO: ADD-REF].
+the Zenodo repository for this manuscript (see [Data availability](#data-availability)).
 
 As explained above, when analysing a heterogeneous population of cells, a 
 clustering analysis is required to identify the cell types that drive this 
@@ -421,17 +429,15 @@ While quality control and pre-processing is an important topic in scRNAseq
 analysis, we feel that it has been covered in detail in other articles.
 For this analysis, we have performed quality control, pre-processing and 
 exploratory data analysis in the "DataPreparationBASiCSWorkflow.Rmd" document of
-the Zenodo repository [TODO: ADD-REF]. 
+the Zenodo repository for this manuscript (see [Data availability](#data-availability)).
 The code presented there needs to be run before running the rest of this 
 workflow. Here, we load the pre-processed data. This consists of two separate 
 `SingleCellExperiment` data objects: one for presomitic and one for
 somitic mesoderm cells. 
 
 ```{r SCE-load}
 # Website were the files are located
-chains_website <- "https://gitlab.com/alanocallaghan/basicsworkflow2020/-/raw/master/"
-# chains_website <- "https://git.ecdf.ed.ac.uk/vallejosgroup/basicsworkflow2020/-/raw/master/"
-# chains_website <- "https://zenodo.org/record/5243265/files/"
+chains_website <- "https://zenodo.org/record/10251224/files/"
 # To avoid timeout issues as we are downloading large files
 options(timeout = 1000)
 # File download
@@ -536,6 +542,7 @@ Extra parameters can be used to store the output
 (`StoreChains`, `StoreDir`, `RunName`) and to monitor the progress of the 
 algorithm (`PrintProgress`).
 
+
 Here, we run the MCMC sampler separately for somitic and pre-somitic mesoderm
 cells. We use 20,000 iterations (`N`), discarding the initial 10,000 iterations
 as burn-in, (`Burn`), and saving parameter values only once in each 10
@@ -582,33 +589,32 @@ chain_psm <- BASiCS_MCMC(
 This first of these chains takes 140 minutes to complete on a 3.4 GHz Intel
 Core i7 4770k procesor with 32GB RAM, while the second takes 159 minutes.
 For convenience, these can be obtained online at 
-[https://doi.org/10.5281/zenodo.5243265](https://doi.org/10.5281/zenodo.5243265).
+[https://doi.org/10.5281/zenodo.10251224](https://doi.org/10.5281/zenodo.10251224).
 
 ```{r download-chain-SM}
 # Website were the files are located
-chains_website <- "https://gitlab.com/alanocallaghan/basicsworkflow2020/-/raw/master/"
-# chains_website <- "https://zenodo.org/record/5243265/files/"
+chains_website <- "https://zenodo.org/record/10251224/files/"
 # To avoid timeout issues as we are downloading large files
 options(timeout = 1000)
 # File download
 ## The code below uses `file.exists` to check if files were previously downloaded
 ## After download, files are then stored in an `rds` sub-folder 
-if (!file.exists("rds/chain_SM.Rds")) {
+if (!file.exists("rds/chain_sm.Rds")) {
   download.file(
-    paste0(chains_website, "rds/chain_SM.Rds"),
-    destfile = "rds/chain_SM.Rds"
+    paste0(chains_website, "rds/chain_sm.Rds"),
+    destfile = "rds/chain_sm.Rds"
   )
 }
-if (!file.exists("rds/chain_PSM.Rds")) {
+if (!file.exists("rds/chain_psm.Rds")) {
   download.file(
-    paste0(chains_website, "rds/chain_PSM.Rds"),
-    destfile = "rds/chain_PSM.Rds"
+    paste0(chains_website, "rds/chain_psm.Rds"),
+    destfile = "rds/chain_psm.Rds"
   )
 }
 
 ## Loads files after download
-chain_sm <- readRDS("rds/chain_SM.Rds")
-chain_psm <- readRDS("rds/chain_PSM.Rds")
+chain_sm <- readRDS("rds/chain_sm.Rds")
+chain_psm <- readRDS("rds/chain_psm.Rds")
 ```
 
 The output from `BASiCS_MCMC` is a `BASiCS_Chain` object that contains the 
@@ -659,7 +665,8 @@ For illustration purposes, here we focus on the chains generated for the
 somatic mesoderm cells. However, it is important to analyse convergence for 
 all MCMC chains; we have omitted this analysis for pre-somitic 
 mesoderm cells in the present manuscript for brevity, but they can be viewed
-in the [TODO: ZENODO REPO].
+in the data preparation document shown in the Zenodo repository for
+this manuscript (see [Data availability](#data-availability)).
 
 Traceplots can be used to visually assess the history of MCMC iterations 
 for a specific parameter (e.g. Figure \@ref(fig:convergence-SM), left panel). 
@@ -687,7 +694,7 @@ Here, we illustrate usage for two such metrics focusing on the MCMC
 chain that was obtained for the somitic mesoderm cells (similar results were 
 obtained for pre-somitic mesoderm cells in the
 "DataPreparationBASiCSWorkflow.Rmd" document of
-the Zenodo repository [TODO: ADD-REF]. 
+the Zenodo repository (see [Data availability](#data-availability)). 
 First, we focus on the diagnostic criterion proposed by Geweke [@Geweke1995].
 The latter compares the average of draws obtained during the initial (10\% after 
 burn in, by default) and the final part of the chain (50\% by default)
@@ -734,7 +741,6 @@ large Geweke diagnostic values or low effective sample sizes in a certain
 dataset, then caution should be applied when interpreting the results of the 
 model. These issues can often be addressed by more stringent 
 filtering of genes and cells before performing inference.
-TODO: this is worse than the equivalent with EB prior, mention in that doc
 
 ```{r mcmc-diag, fig.cap="MCMC diagnostics for gene-specific mean expression parameters; somitic mesoderm cells. A: Geweke Z-score for mean expression parameters is plotted against mean expression estimates. Dashed lines represent absolute Z-scores of 3, outside of which we advise caution when interpreting results. B: Effective sample size (ESS) is plotted against mean expression estimates. A dashed line shows a threshold of 100, below which we advise caution when interpreting results."}
 diag_p1 <- BASiCS_DiagPlot(chain_sm, Param = "mu", Measure = "geweke") +
@@ -861,7 +867,7 @@ estimates (Figure \@ref(fig:dispersion-vs-mean)D, Pearson's correlation equal
 to `r with(parameter_df, cor(DM, residualoverdisp))`) but the residual 
 variability estimates generated for these data are less consistent than what we 
 have observed in the context of less sparse data (see the "AnalysisCD4T.Rmd"
-document in the [TODO: ADD-REF] Zenodo repository). 
+document in the Zenodo repository for this manuscript (see [Data availability](#data-availability)). 
 For the majority of genes 
 (`r with(parameter_df[sel_genes,], sum(sign(DM) == sign(residualoverdisp)))`
 out of `r sum(sel_genes)`
@@ -1033,9 +1039,7 @@ This section highlights the use of `r Biocpkg("BASiCS")` to perform differential
 expression tests for mean and variability between different pre-specified 
 populations of cells and experimental conditions. 
 Here, we compare the somitic mesoderm cells, analysed in the previous section,
-to pre-somitic mesoderm cells analysed in the same study
-TODO: ref, explain comparison.
-
+to pre-somitic mesoderm cells analysed in the same study.
 Differential expression testing is performed via the
 `BASiCS_TestDE` function. The main input parameters are
 
@@ -1686,13 +1690,13 @@ sessionInfo()
 ```
 
 
-# Data availability
+# Data availability {#data-availability}
 
 The data used in this article are available for download
 on EBI ArrayExpress under accession number
 [E-MTAB-4888](https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-4888/).
 The MCMC chains used to generate this article can be found in Zenodo
-under the DOI [10.5281/zenodo.5243265](https://doi.org/10.5281/zenodo.5243265).
+under the DOI [10.5281/zenodo.10251224](https://doi.org/10.5281/zenodo.10251224).
 
 # Software availability
 

diff --git a/Workflow.bib b/Workflow.bib
@@ -1561,3 +1561,20 @@ @article{Lloyd-Smith2007
   langid = {english},
   file = {/home/alan/Zotero/storage/FUMMVMLI/Lloyd-Smith - 2007 - Maximum Likelihood Estimation of the Negative Bino.pdf}
 }
+@article{Zappia2017,
+  title = {Splatter: Simulation of Single-Cell {{RNA}} Sequencing Data},
+  shorttitle = {Splatter},
+  author = {Zappia, Luke and Phipson, Belinda and Oshlack, Alicia},
+  year = {2017},
+  month = dec,
+  journal = {Genome Biology},
+  volume = {18},
+  number = {1},
+  pages = {174},
+  issn = {1474-760X},
+  doi = {10.1186/s13059-017-1305-0},
+  urldate = {2020-02-07},
+  abstract = {As single-cell RNA sequencing (scRNA-seq) technologies have rapidly developed, so have analysis methods. Many methods have been tested, developed, and validated using simulated datasets. Unfortunately, current simulations are often poorly documented, their similarity to real data is not demonstrated, or reproducible code is not available. Here, we present the Splatter Bioconductor package for simple, reproducible, and well-documented simulation of scRNA-seq data. Splatter provides an interface to multiple simulation methods including Splat, our own simulation, based on a gamma-Poisson distribution. Splat can simulate single populations of cells, populations with multiple cell types, or differentiation paths.},
+  langid = {english},
+  file = {/home/alan/Zotero/storage/YLXBGQVH/Zappia et al. - 2017 - Splatter simulation of single-cell RNA sequencing.pdf}
+}
diff --git a/figure/Workflow-Schematic.pdf b/figure/Workflow-Schematic.pdf