Skip to content

Commit

Permalink
Bump Zenodo link
Browse files Browse the repository at this point in the history
  • Loading branch information
alanocallaghan committed Dec 2, 2023
1 parent d569c8a commit b5cc61c
Show file tree
Hide file tree
Showing 4 changed files with 51 additions and 30 deletions.
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
all: Workflow.pdf

IMAGE=alanocallaghan/basicsworkflow2020-docker
VERSION=0.4.0
VERSION=0.5.3

%.pdf: %.Rmd figure/Workflow-Overview.pdf figure/Workflow-Schematic.pdf
docker run -v $(shell pwd):/home/rstudio/mycode \
Expand Down
62 changes: 33 additions & 29 deletions Workflow.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ author:
- &MRC MRC Human Genetics Unit, Institute of Genetics \& Cancer,
University of Edinburgh, Western General Hospital, Crewe Road, Edinburgh,
EH4 2XU, UK
email: "a.b.o'[email protected]"
email: "[email protected]"
- name: Nils Eling
affiliation:
- &UZH Department of Quantitative Biomedicine, University of Zurich,
Expand Down Expand Up @@ -174,6 +174,9 @@ The latter enables the quantification of a gene-level *residual over-dispersion*
as a measure of transcriptional noise that is not confounded by differences in
gene expression. Furthermore, when available, `r Biocpkg("BASiCS")` can also
leverage extrinsic spike-in molecules to aid data normalisation.
A previous study has shown that BASiCS, when used as a generative
model, is well-tuned to capture and recapitulate the main properties of
typical scRNAseq data using posterior predictive checks [@Zappia2017].

<!--- Describes aims for the workflow and introduces BASiCS --->
This article complements existing scRNA-seq workflows based on the Bioconductor
Expand Down Expand Up @@ -299,7 +302,8 @@ where less information is available.

A high-level overview for the model implemented in `r Biocpkg("BASiCS")` is
shown in Figure \@ref(fig:basics-schematic),
and a more detailed description is provided in [TODO: REPO LINK].
and a more detailed description is provided in the `ExtendedMethods` document
in the Zenodo repository for this manuscript (see [Data availability](#data-availability)).
Model parameters are designed to capture different
aspects of scRNAseq data. First, *nuisance* parameters are used to
capture technical artifacts. This includes cell-specific parameters (indexed by
Expand Down Expand Up @@ -333,8 +337,12 @@ more reliable estimates. The global trend is then used to derive gene-specific
*residual over-dispersion* parameters $\epsilon_i$ that are not confounded by
mean expression. Similar to the DM values implemented in `r Biocpkg("scran")`,
these are defined as deviations with respect to the overall trend.
These residual over-dispersion values can be used to quantify transcriptional
heterogeneity without any confounding with mean expression, though
it should be noted that they may also represent structural heterogeneity
resulting from heterogeneous cell types within the population being studied.

```{r basics-schematic, echo = FALSE, out.width="\\textwidth", fig.cap="TODO A schematic representation of the features of the BASiCS model."}
```{r basics-schematic, echo = FALSE, out.width="\\textwidth", fig.cap="A schematic representation of the features of the BASiCS model. BASiCS accounts for cell-to-cell and batch-to-batch variability using nuisance parameters. Accounting for cell-to-cell and batch-to-batch technical variability allows BASiCS to robustly perform gene-level inference of mean and variability. Furthermore, by accounting for the association between mean and variability in scRNAseq, BASiCS can also infer transcriptional noise using the residual variability parameter $\\epsilon_i$."}
knitr::include_graphics("figure/Workflow-Schematic.pdf", auto_pdf = TRUE)
```

Expand Down Expand Up @@ -392,7 +400,7 @@ To illustrate the use of `r Biocpkg("BASiCS")` in this context, we use
droplet-based scRNAseq generated using the 10X microfluidics platform. An
example using a dataset generated using a plate-based protocol including
spike-ins is provided in the "AnalysisCD4T.Rmd" document of the
Zenodo repository [TODO: ADD-REF].
Zenodo repository for this manuscript (see [Data availability](#data-availability)).
In particular, we focus on the data generated by @Ibarra-Soria2018 which
contains expression profiles for 20,819 cells collected from E8.25 mouse
embryos. These data comprise a highly heterogeneous population of cells with
Expand All @@ -403,7 +411,7 @@ The authors also provide additional information summarising the experimental
design (animal of source for each cell) and the cluster labels generated by
in their analysis. The code required to download these data is provided in
the "DataPreparationBASiCSWorkflow.Rmd" document of
the Zenodo repository [TODO: ADD-REF].
the Zenodo repository for this manuscript (see [Data availability](#data-availability)).

As explained above, when analysing a heterogeneous population of cells, a
clustering analysis is required to identify the cell types that drive this
Expand All @@ -421,17 +429,15 @@ While quality control and pre-processing is an important topic in scRNAseq
analysis, we feel that it has been covered in detail in other articles.
For this analysis, we have performed quality control, pre-processing and
exploratory data analysis in the "DataPreparationBASiCSWorkflow.Rmd" document of
the Zenodo repository [TODO: ADD-REF].
the Zenodo repository for this manuscript (see [Data availability](#data-availability)).
The code presented there needs to be run before running the rest of this
workflow. Here, we load the pre-processed data. This consists of two separate
`SingleCellExperiment` data objects: one for presomitic and one for
somitic mesoderm cells.

```{r SCE-load}
# Website were the files are located
chains_website <- "https://gitlab.com/alanocallaghan/basicsworkflow2020/-/raw/master/"
# chains_website <- "https://git.ecdf.ed.ac.uk/vallejosgroup/basicsworkflow2020/-/raw/master/"
# chains_website <- "https://zenodo.org/record/5243265/files/"
chains_website <- "https://zenodo.org/record/10251224/files/"
# To avoid timeout issues as we are downloading large files
options(timeout = 1000)
# File download
Expand Down Expand Up @@ -536,6 +542,7 @@ Extra parameters can be used to store the output
(`StoreChains`, `StoreDir`, `RunName`) and to monitor the progress of the
algorithm (`PrintProgress`).


Here, we run the MCMC sampler separately for somitic and pre-somitic mesoderm
cells. We use 20,000 iterations (`N`), discarding the initial 10,000 iterations
as burn-in, (`Burn`), and saving parameter values only once in each 10
Expand Down Expand Up @@ -582,33 +589,32 @@ chain_psm <- BASiCS_MCMC(
This first of these chains takes 140 minutes to complete on a 3.4 GHz Intel
Core i7 4770k procesor with 32GB RAM, while the second takes 159 minutes.
For convenience, these can be obtained online at
[https://doi.org/10.5281/zenodo.5243265](https://doi.org/10.5281/zenodo.5243265).
[https://doi.org/10.5281/zenodo.10251224](https://doi.org/10.5281/zenodo.10251224).

```{r download-chain-SM}
# Website were the files are located
chains_website <- "https://gitlab.com/alanocallaghan/basicsworkflow2020/-/raw/master/"
# chains_website <- "https://zenodo.org/record/5243265/files/"
chains_website <- "https://zenodo.org/record/10251224/files/"
# To avoid timeout issues as we are downloading large files
options(timeout = 1000)
# File download
## The code below uses `file.exists` to check if files were previously downloaded
## After download, files are then stored in an `rds` sub-folder
if (!file.exists("rds/chain_SM.Rds")) {
if (!file.exists("rds/chain_sm.Rds")) {
download.file(
paste0(chains_website, "rds/chain_SM.Rds"),
destfile = "rds/chain_SM.Rds"
paste0(chains_website, "rds/chain_sm.Rds"),
destfile = "rds/chain_sm.Rds"
)
}
if (!file.exists("rds/chain_PSM.Rds")) {
if (!file.exists("rds/chain_psm.Rds")) {
download.file(
paste0(chains_website, "rds/chain_PSM.Rds"),
destfile = "rds/chain_PSM.Rds"
paste0(chains_website, "rds/chain_psm.Rds"),
destfile = "rds/chain_psm.Rds"
)
}
## Loads files after download
chain_sm <- readRDS("rds/chain_SM.Rds")
chain_psm <- readRDS("rds/chain_PSM.Rds")
chain_sm <- readRDS("rds/chain_sm.Rds")
chain_psm <- readRDS("rds/chain_psm.Rds")
```

The output from `BASiCS_MCMC` is a `BASiCS_Chain` object that contains the
Expand Down Expand Up @@ -659,7 +665,8 @@ For illustration purposes, here we focus on the chains generated for the
somatic mesoderm cells. However, it is important to analyse convergence for
all MCMC chains; we have omitted this analysis for pre-somitic
mesoderm cells in the present manuscript for brevity, but they can be viewed
in the [TODO: ZENODO REPO].
in the data preparation document shown in the Zenodo repository for
this manuscript (see [Data availability](#data-availability)).

Traceplots can be used to visually assess the history of MCMC iterations
for a specific parameter (e.g. Figure \@ref(fig:convergence-SM), left panel).
Expand Down Expand Up @@ -687,7 +694,7 @@ Here, we illustrate usage for two such metrics focusing on the MCMC
chain that was obtained for the somitic mesoderm cells (similar results were
obtained for pre-somitic mesoderm cells in the
"DataPreparationBASiCSWorkflow.Rmd" document of
the Zenodo repository [TODO: ADD-REF].
the Zenodo repository (see [Data availability](#data-availability)).
First, we focus on the diagnostic criterion proposed by Geweke [@Geweke1995].
The latter compares the average of draws obtained during the initial (10\% after
burn in, by default) and the final part of the chain (50\% by default)
Expand Down Expand Up @@ -734,7 +741,6 @@ large Geweke diagnostic values or low effective sample sizes in a certain
dataset, then caution should be applied when interpreting the results of the
model. These issues can often be addressed by more stringent
filtering of genes and cells before performing inference.
TODO: this is worse than the equivalent with EB prior, mention in that doc

```{r mcmc-diag, fig.cap="MCMC diagnostics for gene-specific mean expression parameters; somitic mesoderm cells. A: Geweke Z-score for mean expression parameters is plotted against mean expression estimates. Dashed lines represent absolute Z-scores of 3, outside of which we advise caution when interpreting results. B: Effective sample size (ESS) is plotted against mean expression estimates. A dashed line shows a threshold of 100, below which we advise caution when interpreting results."}
diag_p1 <- BASiCS_DiagPlot(chain_sm, Param = "mu", Measure = "geweke") +
Expand Down Expand Up @@ -861,7 +867,7 @@ estimates (Figure \@ref(fig:dispersion-vs-mean)D, Pearson's correlation equal
to `r with(parameter_df, cor(DM, residualoverdisp))`) but the residual
variability estimates generated for these data are less consistent than what we
have observed in the context of less sparse data (see the "AnalysisCD4T.Rmd"
document in the [TODO: ADD-REF] Zenodo repository).
document in the Zenodo repository for this manuscript (see [Data availability](#data-availability)).
For the majority of genes
(`r with(parameter_df[sel_genes,], sum(sign(DM) == sign(residualoverdisp)))`
out of `r sum(sel_genes)`
Expand Down Expand Up @@ -1033,9 +1039,7 @@ This section highlights the use of `r Biocpkg("BASiCS")` to perform differential
expression tests for mean and variability between different pre-specified
populations of cells and experimental conditions.
Here, we compare the somitic mesoderm cells, analysed in the previous section,
to pre-somitic mesoderm cells analysed in the same study
TODO: ref, explain comparison.

to pre-somitic mesoderm cells analysed in the same study.
Differential expression testing is performed via the
`BASiCS_TestDE` function. The main input parameters are

Expand Down Expand Up @@ -1686,13 +1690,13 @@ sessionInfo()
```


# Data availability
# Data availability {#data-availability}

The data used in this article are available for download
on EBI ArrayExpress under accession number
[E-MTAB-4888](https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-4888/).
The MCMC chains used to generate this article can be found in Zenodo
under the DOI [10.5281/zenodo.5243265](https://doi.org/10.5281/zenodo.5243265).
under the DOI [10.5281/zenodo.10251224](https://doi.org/10.5281/zenodo.10251224).

# Software availability

Expand Down
17 changes: 17 additions & 0 deletions Workflow.bib
Original file line number Diff line number Diff line change
Expand Up @@ -1561,3 +1561,20 @@ @article{Lloyd-Smith2007
langid = {english},
file = {/home/alan/Zotero/storage/FUMMVMLI/Lloyd-Smith - 2007 - Maximum Likelihood Estimation of the Negative Bino.pdf}
}
@article{Zappia2017,
title = {Splatter: Simulation of Single-Cell {{RNA}} Sequencing Data},
shorttitle = {Splatter},
author = {Zappia, Luke and Phipson, Belinda and Oshlack, Alicia},
year = {2017},
month = dec,
journal = {Genome Biology},
volume = {18},
number = {1},
pages = {174},
issn = {1474-760X},
doi = {10.1186/s13059-017-1305-0},
urldate = {2020-02-07},
abstract = {As single-cell RNA sequencing (scRNA-seq) technologies have rapidly developed, so have analysis methods. Many methods have been tested, developed, and validated using simulated datasets. Unfortunately, current simulations are often poorly documented, their similarity to real data is not demonstrated, or reproducible code is not available. Here, we present the Splatter Bioconductor package for simple, reproducible, and well-documented simulation of scRNA-seq data. Splatter provides an interface to multiple simulation methods including Splat, our own simulation, based on a gamma-Poisson distribution. Splat can simulate single populations of cells, populations with multiple cell types, or differentiation paths.},
langid = {english},
file = {/home/alan/Zotero/storage/YLXBGQVH/Zappia et al. - 2017 - Splatter simulation of single-cell RNA sequencing.pdf}
}
Binary file modified figure/Workflow-Schematic.pdf
Binary file not shown.

0 comments on commit b5cc61c

Please sign in to comment.