Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft of results section for scpca-nf #34

Merged
merged 8 commits into from
Feb 28, 2024
71 changes: 38 additions & 33 deletions content/03.results.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,39 +36,44 @@ The project card will also indicate the type(s) of sequencing performed, includi

## Uniform processing of data available on the ScPCA Portal

1. Processing data with scpca-nf and alevin-fry
- All data available on the portal was uniformly processed using scpca-nf, an open-source and efficient Nextflow workflow for quantifying single-cell and single-nuclei RNA-seq data.
- The workflow uses `salmon alevin` and `alevin-fry` to quantify gene expression data and outputs both raw and normalized counts stored as `SingleCellExperiment` and `AnnData` objects.
- In building the workflow we sought to look for a tool that was fast and memory efficient with comparable results to other popular tools, like `Cell Ranger`.
- Reads are aligned using the selective alignment option of `salmon alevin` to an index with transcripts corresponding to spliced cDNA and intronic regions, denoted by `alevin-fry` as a `splici` index.
- We compared quantification of single-cell and single-nuclei samples with `alevin-fry` and `Cell Ranger` and observed a decrease in both run time and memory usage in `alevin-fry` compared to `Cell Ranger` (FigS1A).
- When comparing the total UMIs per cell, total genes detected per cell, and mean gene expression, there was no observable difference between `alevin-fry` and `Cell Ranger` (FigS1B-D).
- By utilizing `alevin-fry` in the `scpca-nf` workflow we can process multiple samples at a fraction of the time and cost.

2. Post-processing of quantified gene expression data (Fig 2A)
- In addition to quantification of gene expression, `scpca-nf` also performs filtering, normalization, dimensionality reduction, and cell type annotation.
- The output from `alevin-fry` includes a gene by cell count matrix for all barcodes identified, even those that may not contain true cells. This matrix is stored in a `SingleCellExperiment` and output from the workflow as an `_unfiltered.rds` file.
- The unfiltered gene by cell counts matrices are then filtered using `DropletUtils::emptyDropsCellRanger()` to remove any barcodes that are not likely to contain cells. All cells that pass this filtering are saved to a filtered `SingleCellExperiment` object and `_filtered.rds` file.
- This filtered object is used as input to the post-processing part of the workflow. This includes removal of low-quality cells using `miQC`, normalization, and dimensionality reduction. The final step of the post-processing performed in `scpca-nf` is classification of cell types using automated methods, `SingleR` and `CellAssign`. The results from this analysis are stored in a processed object saved to a `processed.rds`.
- By providing all three files, unfiltered, filtered, and processed this allows users to perform their own filtering and normalization or to skip those steps and use the already processed objects.
- Finally, all `SingleCellExperiment` objects saved as `.rds` files are converted to `AnnData` objects and saved as `.hdf5` files to allow for downstream processing in either R or Python.
- On the Portal, users can choose to download data as either `SingleCellExperiment` or `AnnData` objects and all downloads will contain all three objects output from `scpca-nf`, the unfiltered, filtered, and processed objects (do we include the download illustrations in the figure to display this?)

3. QC report (Fig 2B)
- Along with outputting the uniformly processed data files, `scpca-nf` also includes a step to create a quality control report for each library.
- This report includes a summary of processing information and library statistics, e.g., the total number of mapped reads, total number of cells, and relevant versions of tools used within the workflow like `salmon` and `alevin-fry`.
- Each report also includes summarized plots showing the quality of each library.
- The knee plot shown in the report ranks the total number of UMIs in each droplet and indicates cells that remained after filtering out empty droplets.
- For each cell that passes filtering out empty droplets, the number of total UMIs, genes detected, and mitochondrial reads is calculated. These cell metrics are summarized in a single plot.
- To remove low-quality cells from the counts matrices, `scpca-nf` applies `miQC`, a data driven approach to filtering cells. The `miQC` model and a plot showing which cells are kept and removed when filtering with `miQC` are shown in the QC report.
- Finally, remaining cells are normalized and undergo dimensionality reduction. The QC report includes a single UMAP where cells are colored by the total number of genes detected and a faceted UMAP where cells are colored by the expression of a top highly variable gene.

4. Benefits of scpca-nf/ Nextflow allows for reproducibility and portability (Does this fit here or should it be earlier before describing the workflow?)
- Using Nextflow as the backbone for the `scpca-nf` workflow ensures reproducibility and portability for users on other systems.
- The scpca-nf workflow can be run in almost any environment including slurm, torque, AWS batch, etc (https://www.nextflow.io/docs/latest/executor.html). This allows users to run this workflow in the environment that they are comfortable in with minimal set-up of dependencies.
- Nextflow handles all dependencies automatically and set up generally requires only organizing input files and configuring Nextflow to run in your environment.
- Each process in the workflow is run in a docker container, so users only need to install Nextflow and docker to be able to use this workflow.
- Nextflow also handles parallelizing processing based on your environment and will configure processing so that run time is minimal.
All data available on the Portal was uniformly processed using [`scpca-nf`](https://github.com/AlexsLemonade/scpca-nf), an open-source and efficient Nextflow[@url:https://www.nextflow.io/docs/latest/index.html] workflow for quantifying single-cell and single-nuclei RNA-seq data.
Using Nextflow as the backbone for the `scpca-nf` workflow ensures both reproducibility and portability.
All dependencies for the workflow are handled automatically, as each process in the workflow is run in a Docker container.
Nextflow is compatible with various computing environments, including high-performance computing and cloud-based computing, allowing users to run the workflow in their preferred environment.
Setup requires organizing input files and updating a single configuration file for your computing environment after installing Nextflow and either Docker or Singularity.
Nextflow will also handle parallelizing sample processing as allowed by your environment, minimizing run time.
The combination of being able to execute a Nextflow workflow in any environment and run individual processes in Docker containers makes this workflow easily portable for external use.

When building `scpca-nf`, we sought a fast and memory-efficient tool for gene expression quantification to minimize processing costs.
We expected many users of the Portal to have their own single-cell or single-nuclei data processed with Cell Ranger[@url:https://www.10xgenomics.com/support/software/cell-ranger/latest], due to its popularity.
Thus, selecting a tool with comparable results to Cell Ranger was also desirable.
In comparing `alevin-fry` [@doi:10.1038/s41592-022-01408-3] to Cell Ranger, we found `alevin-fry` had a lower run time and memory usage (Supplemental Figure 1A), while retaining comparable mean gene expression for all genes (Supplemental Figure 1B), total UMIs per cell (Supplemental Figure 1C), or total genes detected per cell (Supplemental Figure 1D).
<!--TODO: Do we like including this here or is it not worth mentioning? -->
(All analyses comparing gene expression quantification tools are available in a public analysis repository[@url:https://github.com/AlexsLemonade/alsf-scpca].)
Based on these results, we elected to use `salmon alevin` and `alevin-fry` [@doi:10.1038/s41592-022-01408-3] in `scpca-nf` to quantify gene expression data.

`scpca-nf` takes FASTQ files as input (Figure 2A).
Reads are aligned using the selective alignment option of `salmon alevin` to an index with transcripts corresponding to spliced cDNA and intronic regions, denoted by `alevin-fry` as a `splici` index.
The output from `alevin-fry` includes a gene-by-cell count matrix for all barcodes identified, even those that may not contain true cells.
This unfiltered counts matrix is stored in a `SingleCellExperiment` object[@doi:10.1038/s41592-019-0654-x] and output from the workflow to a `.rds` file with the suffix `_unfiltered.rds`.

`scpca-nf` performs filtering of empty droplets, removal of low-quality cells, normalization, dimensionality reduction, and cell type annotation (Figure 2A).
The unfiltered gene by cell counts matrices are filtered to remove any barcodes that are not likely to contain cells using `DropletUtils::emptyDropsCellRanger()`[@doi:10.1186/s13059-019-1662-y], with all cells that pass being saved to a `SingleCellExperiment` object and `.rds` file with the suffix `_filtered.rds`.
Then, low-quality cells are identified and removed with `miQC` [@doi: 10.1371/journal.pcbi.1009290], which jointly models the proportion of mitochondrial reads and detected genes per cell and calculates a probability that each cell is compromised.
The remaining cells are normalized [@doi:10.1186/s13059-016-0947-7] and undergo dimensionality reduction using both principal component analysis (PCA) and UMAP.
Finally, cell types are classified using two automated methods, `SingleR`[@doi:10.1038/s41590-018-0276-y] and `CellAssign`[@doi:10.1038/s41592-019-0529-1].
The results from this analysis are stored in a processed `SingleCellExperiment` object saved to a `.rds` file with the suffix `_processed.rds`.

To make downloading from the Portal convenient for R and Python users, downloads are available as either `SingleCellExperiment` or `AnnData`[@doi:10.1101/2021.12.16.473007] objects.
All `SingleCellExperiment` objects saved as `.rds` files are converted to `AnnData` objects and saved as `.hdf5` files in `scpca-nf` (Figure 2A).
Downloads contain the unfiltered, filtered, and processed objects from `scpca-nf` to allow users to choose to perform their own filtering and normalization or to start their analysis from a processed object.

All downloads from the Portal include a quality control (QC) report with a summary of processing information (e.g., `alevin-fry` version), library statistics (e.g., the total number of cells), and a collection of diagnostic plots for each library (Figure 2B-G).
The knee plot includes all droplets (i.e., before removing empty droplets) sorted based on the total number of UMIs, and those retained after filtering empty droplets are indicated in the plot (Figure 2B).
For each cell that remains after filtering empty droplets, the number of total UMIs, genes detected, and mitochondrial reads are calculated and summarized in a scatter plot (Figure 2C).
We include plots showing the `miQC` model and which cells are kept and removed after filtering with `miQC` (Figure 2D-E).
A UMAP plot with cells colored by the total number of genes detected and a faceted UMAP plot where cells are colored by the expression of a top highly variable gene are also available (Figure 2F-G).


## Making samples with additional modalities available on the Portal

Expand Down
Loading