-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
d569c8a
commit b5cc61c
Showing
4 changed files
with
51 additions
and
30 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,7 +6,7 @@ author: | |
- &MRC MRC Human Genetics Unit, Institute of Genetics \& Cancer, | ||
University of Edinburgh, Western General Hospital, Crewe Road, Edinburgh, | ||
EH4 2XU, UK | ||
email: "a.b.o'[email protected]" | ||
email: "[email protected]" | ||
- name: Nils Eling | ||
affiliation: | ||
- &UZH Department of Quantitative Biomedicine, University of Zurich, | ||
|
@@ -174,6 +174,9 @@ The latter enables the quantification of a gene-level *residual over-dispersion* | |
as a measure of transcriptional noise that is not confounded by differences in | ||
gene expression. Furthermore, when available, `r Biocpkg("BASiCS")` can also | ||
leverage extrinsic spike-in molecules to aid data normalisation. | ||
A previous study has shown that BASiCS, when used as a generative | ||
model, is well-tuned to capture and recapitulate the main properties of | ||
typical scRNAseq data using posterior predictive checks [@Zappia2017]. | ||
|
||
<!--- Describes aims for the workflow and introduces BASiCS ---> | ||
This article complements existing scRNA-seq workflows based on the Bioconductor | ||
|
@@ -299,7 +302,8 @@ where less information is available. | |
|
||
A high-level overview for the model implemented in `r Biocpkg("BASiCS")` is | ||
shown in Figure \@ref(fig:basics-schematic), | ||
and a more detailed description is provided in [TODO: REPO LINK]. | ||
and a more detailed description is provided in the `ExtendedMethods` document | ||
in the Zenodo repository for this manuscript (see [Data availability](#data-availability)). | ||
Model parameters are designed to capture different | ||
aspects of scRNAseq data. First, *nuisance* parameters are used to | ||
capture technical artifacts. This includes cell-specific parameters (indexed by | ||
|
@@ -333,8 +337,12 @@ more reliable estimates. The global trend is then used to derive gene-specific | |
*residual over-dispersion* parameters $\epsilon_i$ that are not confounded by | ||
mean expression. Similar to the DM values implemented in `r Biocpkg("scran")`, | ||
these are defined as deviations with respect to the overall trend. | ||
These residual over-dispersion values can be used to quantify transcriptional | ||
heterogeneity without any confounding with mean expression, though | ||
it should be noted that they may also represent structural heterogeneity | ||
resulting from heterogeneous cell types within the population being studied. | ||
|
||
```{r basics-schematic, echo = FALSE, out.width="\\textwidth", fig.cap="TODO A schematic representation of the features of the BASiCS model."} | ||
```{r basics-schematic, echo = FALSE, out.width="\\textwidth", fig.cap="A schematic representation of the features of the BASiCS model. BASiCS accounts for cell-to-cell and batch-to-batch variability using nuisance parameters. Accounting for cell-to-cell and batch-to-batch technical variability allows BASiCS to robustly perform gene-level inference of mean and variability. Furthermore, by accounting for the association between mean and variability in scRNAseq, BASiCS can also infer transcriptional noise using the residual variability parameter $\\epsilon_i$."} | ||
knitr::include_graphics("figure/Workflow-Schematic.pdf", auto_pdf = TRUE) | ||
``` | ||
|
||
|
@@ -392,7 +400,7 @@ To illustrate the use of `r Biocpkg("BASiCS")` in this context, we use | |
droplet-based scRNAseq generated using the 10X microfluidics platform. An | ||
example using a dataset generated using a plate-based protocol including | ||
spike-ins is provided in the "AnalysisCD4T.Rmd" document of the | ||
Zenodo repository [TODO: ADD-REF]. | ||
Zenodo repository for this manuscript (see [Data availability](#data-availability)). | ||
In particular, we focus on the data generated by @Ibarra-Soria2018 which | ||
contains expression profiles for 20,819 cells collected from E8.25 mouse | ||
embryos. These data comprise a highly heterogeneous population of cells with | ||
|
@@ -403,7 +411,7 @@ The authors also provide additional information summarising the experimental | |
design (animal of source for each cell) and the cluster labels generated by | ||
in their analysis. The code required to download these data is provided in | ||
the "DataPreparationBASiCSWorkflow.Rmd" document of | ||
the Zenodo repository [TODO: ADD-REF]. | ||
the Zenodo repository for this manuscript (see [Data availability](#data-availability)). | ||
|
||
As explained above, when analysing a heterogeneous population of cells, a | ||
clustering analysis is required to identify the cell types that drive this | ||
|
@@ -421,17 +429,15 @@ While quality control and pre-processing is an important topic in scRNAseq | |
analysis, we feel that it has been covered in detail in other articles. | ||
For this analysis, we have performed quality control, pre-processing and | ||
exploratory data analysis in the "DataPreparationBASiCSWorkflow.Rmd" document of | ||
the Zenodo repository [TODO: ADD-REF]. | ||
the Zenodo repository for this manuscript (see [Data availability](#data-availability)). | ||
The code presented there needs to be run before running the rest of this | ||
workflow. Here, we load the pre-processed data. This consists of two separate | ||
`SingleCellExperiment` data objects: one for presomitic and one for | ||
somitic mesoderm cells. | ||
|
||
```{r SCE-load} | ||
# Website were the files are located | ||
chains_website <- "https://gitlab.com/alanocallaghan/basicsworkflow2020/-/raw/master/" | ||
# chains_website <- "https://git.ecdf.ed.ac.uk/vallejosgroup/basicsworkflow2020/-/raw/master/" | ||
# chains_website <- "https://zenodo.org/record/5243265/files/" | ||
chains_website <- "https://zenodo.org/record/10251224/files/" | ||
# To avoid timeout issues as we are downloading large files | ||
options(timeout = 1000) | ||
# File download | ||
|
@@ -536,6 +542,7 @@ Extra parameters can be used to store the output | |
(`StoreChains`, `StoreDir`, `RunName`) and to monitor the progress of the | ||
algorithm (`PrintProgress`). | ||
|
||
|
||
Here, we run the MCMC sampler separately for somitic and pre-somitic mesoderm | ||
cells. We use 20,000 iterations (`N`), discarding the initial 10,000 iterations | ||
as burn-in, (`Burn`), and saving parameter values only once in each 10 | ||
|
@@ -582,33 +589,32 @@ chain_psm <- BASiCS_MCMC( | |
This first of these chains takes 140 minutes to complete on a 3.4 GHz Intel | ||
Core i7 4770k procesor with 32GB RAM, while the second takes 159 minutes. | ||
For convenience, these can be obtained online at | ||
[https://doi.org/10.5281/zenodo.5243265](https://doi.org/10.5281/zenodo.5243265). | ||
[https://doi.org/10.5281/zenodo.10251224](https://doi.org/10.5281/zenodo.10251224). | ||
|
||
```{r download-chain-SM} | ||
# Website were the files are located | ||
chains_website <- "https://gitlab.com/alanocallaghan/basicsworkflow2020/-/raw/master/" | ||
# chains_website <- "https://zenodo.org/record/5243265/files/" | ||
chains_website <- "https://zenodo.org/record/10251224/files/" | ||
# To avoid timeout issues as we are downloading large files | ||
options(timeout = 1000) | ||
# File download | ||
## The code below uses `file.exists` to check if files were previously downloaded | ||
## After download, files are then stored in an `rds` sub-folder | ||
if (!file.exists("rds/chain_SM.Rds")) { | ||
if (!file.exists("rds/chain_sm.Rds")) { | ||
download.file( | ||
paste0(chains_website, "rds/chain_SM.Rds"), | ||
destfile = "rds/chain_SM.Rds" | ||
paste0(chains_website, "rds/chain_sm.Rds"), | ||
destfile = "rds/chain_sm.Rds" | ||
) | ||
} | ||
if (!file.exists("rds/chain_PSM.Rds")) { | ||
if (!file.exists("rds/chain_psm.Rds")) { | ||
download.file( | ||
paste0(chains_website, "rds/chain_PSM.Rds"), | ||
destfile = "rds/chain_PSM.Rds" | ||
paste0(chains_website, "rds/chain_psm.Rds"), | ||
destfile = "rds/chain_psm.Rds" | ||
) | ||
} | ||
## Loads files after download | ||
chain_sm <- readRDS("rds/chain_SM.Rds") | ||
chain_psm <- readRDS("rds/chain_PSM.Rds") | ||
chain_sm <- readRDS("rds/chain_sm.Rds") | ||
chain_psm <- readRDS("rds/chain_psm.Rds") | ||
``` | ||
|
||
The output from `BASiCS_MCMC` is a `BASiCS_Chain` object that contains the | ||
|
@@ -659,7 +665,8 @@ For illustration purposes, here we focus on the chains generated for the | |
somatic mesoderm cells. However, it is important to analyse convergence for | ||
all MCMC chains; we have omitted this analysis for pre-somitic | ||
mesoderm cells in the present manuscript for brevity, but they can be viewed | ||
in the [TODO: ZENODO REPO]. | ||
in the data preparation document shown in the Zenodo repository for | ||
this manuscript (see [Data availability](#data-availability)). | ||
|
||
Traceplots can be used to visually assess the history of MCMC iterations | ||
for a specific parameter (e.g. Figure \@ref(fig:convergence-SM), left panel). | ||
|
@@ -687,7 +694,7 @@ Here, we illustrate usage for two such metrics focusing on the MCMC | |
chain that was obtained for the somitic mesoderm cells (similar results were | ||
obtained for pre-somitic mesoderm cells in the | ||
"DataPreparationBASiCSWorkflow.Rmd" document of | ||
the Zenodo repository [TODO: ADD-REF]. | ||
the Zenodo repository (see [Data availability](#data-availability)). | ||
First, we focus on the diagnostic criterion proposed by Geweke [@Geweke1995]. | ||
The latter compares the average of draws obtained during the initial (10\% after | ||
burn in, by default) and the final part of the chain (50\% by default) | ||
|
@@ -734,7 +741,6 @@ large Geweke diagnostic values or low effective sample sizes in a certain | |
dataset, then caution should be applied when interpreting the results of the | ||
model. These issues can often be addressed by more stringent | ||
filtering of genes and cells before performing inference. | ||
TODO: this is worse than the equivalent with EB prior, mention in that doc | ||
|
||
```{r mcmc-diag, fig.cap="MCMC diagnostics for gene-specific mean expression parameters; somitic mesoderm cells. A: Geweke Z-score for mean expression parameters is plotted against mean expression estimates. Dashed lines represent absolute Z-scores of 3, outside of which we advise caution when interpreting results. B: Effective sample size (ESS) is plotted against mean expression estimates. A dashed line shows a threshold of 100, below which we advise caution when interpreting results."} | ||
diag_p1 <- BASiCS_DiagPlot(chain_sm, Param = "mu", Measure = "geweke") + | ||
|
@@ -861,7 +867,7 @@ estimates (Figure \@ref(fig:dispersion-vs-mean)D, Pearson's correlation equal | |
to `r with(parameter_df, cor(DM, residualoverdisp))`) but the residual | ||
variability estimates generated for these data are less consistent than what we | ||
have observed in the context of less sparse data (see the "AnalysisCD4T.Rmd" | ||
document in the [TODO: ADD-REF] Zenodo repository). | ||
document in the Zenodo repository for this manuscript (see [Data availability](#data-availability)). | ||
For the majority of genes | ||
(`r with(parameter_df[sel_genes,], sum(sign(DM) == sign(residualoverdisp)))` | ||
out of `r sum(sel_genes)` | ||
|
@@ -1033,9 +1039,7 @@ This section highlights the use of `r Biocpkg("BASiCS")` to perform differential | |
expression tests for mean and variability between different pre-specified | ||
populations of cells and experimental conditions. | ||
Here, we compare the somitic mesoderm cells, analysed in the previous section, | ||
to pre-somitic mesoderm cells analysed in the same study | ||
TODO: ref, explain comparison. | ||
|
||
to pre-somitic mesoderm cells analysed in the same study. | ||
Differential expression testing is performed via the | ||
`BASiCS_TestDE` function. The main input parameters are | ||
|
||
|
@@ -1686,13 +1690,13 @@ sessionInfo() | |
``` | ||
|
||
|
||
# Data availability | ||
# Data availability {#data-availability} | ||
|
||
The data used in this article are available for download | ||
on EBI ArrayExpress under accession number | ||
[E-MTAB-4888](https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-4888/). | ||
The MCMC chains used to generate this article can be found in Zenodo | ||
under the DOI [10.5281/zenodo.5243265](https://doi.org/10.5281/zenodo.5243265). | ||
under the DOI [10.5281/zenodo.10251224](https://doi.org/10.5281/zenodo.10251224). | ||
|
||
# Software availability | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.