diff --git a/content/04.methods.md b/content/04.methods.md index f1dfee4..1203174 100644 --- a/content/04.methods.md +++ b/content/04.methods.md @@ -1,22 +1,44 @@ ## Materials and Methods -### Data generation - - how data was generated in different labs using 10X and then sent to the Data Lab +### Data generation and processing -### Data processing (do we need this section?) - - Mention that all data was processing using `scpca-nf` either by us or external submitters +Raw data and metadata were generated and compiled by each lab and institution contributing to the Portal. +Single-cell or single-nuclei libraries were generated using one of the commercially available kits from 10x Genomics. +For bulk RNA-seq, RNA was collected and sequenced using either paired-end or single-end sequencing. +For spatial transcriptomics, cDNA libraries were generated using the Visium kit from 10x Genomics. +All libraries were processed using our open-source pipeline, `scpca-nf`, to produce summarized gene expression data. ### Processing single-cell and single-nuclei RNA-seq data with alevin-fry - - Use of salmon alevin and alevin-fry to process all raw FASTQ files - - Information on index used - - Parameter choices for alevin-fry + +To quantify RNA-seq gene expression for each cell or nucleus in a library, `scpca-nf` uses `salmon alevin` [@doi:10.1186/s13059-020-02151-8] and `alevin-fry`[@doi:10.1038/s41592-022-01408-3] to generate a gene by cell counts matrix. +Prior to mapping, we generated an index using transcripts from both spliced cDNA and unspliced cDNA sequences, denoted as the `splici` index [@doi:10.1038/s41592-022-01408-3]. +The index was generated from the human genome, GRCh38, Ensembl version 104. +`salmon alevin` was run using selective alignment to the `splici` index with the `--rad` option to generate a reduced alignment data (RAD) file required for input to `alevin-fry`. + +The RAD file was used as input to the recommended `alevin-fry` workflow, with the following customizations. +At the `generate-permit-list` step, we used the `unfiltered-pl` option to provide a list of expected barcodes specific to the 10x kit used to generate each library. +The `quant` step was run using the `cr-like-em` resolution strategy for feature quantification and UMI de-duplication. ### Post alevin-fry processing of single-cell and single-nuclei RNA-seq data - - filtering of empty droplets - - removal of low quality cells - - normalization - - HVG selection - - PCA and UMAP calculation + +The output from running `alevin-fry` includes a gene by cell counts matrix, with reads from both spliced and unspliced reads for all potential cell barcodes. +This output is read into R to create a `SingleCellExperiment` using the `fishpond::load_fry()` function. +The resulting `SingleCellExperiment` contains a `counts` assay with a gene by cell counts matrix where all spliced and unspliced reads for a given gene are totaled together. +We also include a `spliced` assay that contains a gene by cell counts matrix with only spliced reads. +These matrices include all potential cells, including empty droplets, and are provided in the "unfiltered" objects included in downloads from the Portal. + +Each droplet was tested for deviation from the ambient RNA profile using `DropletUtils::emptyDropsCellRanger()` and those with an FDR ≤ 0.01 were retained as likely cells. +If a library did not have a sufficient number of droplets and `DropletUtils::emptyDropsCellRanger()` failed, cells with fewer than 100 UMIs were removed. +Gene expression data for any cells that remain after filtering are provided in the "filtered" objects. + +In addition to removing empty droplets, `scpca-nf` also removes cells from downstream analysis that are likely to be compromised by damage or low-quality sequencing. +`miQC` was used to calculate the probability of each cell being compromised [@doi:10.1371/journal.pcbi.1009290]. +Any cells with a likelihood of being compromised greater than 0.75 and fewer than 200 genes detected were removed before further processing. +The gene expression counts from the remaining cells were log-normalized using the deconvolution method from Lun, Bach, and Marioni [@doi:10.1186/s13059-016-0947-7]. +`scran::modelGeneVar()` was used to model gene variance from the log-normalized counts and `scran::getTopHVGs` was used to select the top 2000 high-variance genes. +These were used as input to calculate the top 50 principal components using `scater::runPCA()`. +Finally, UMAP embeddings were calculated from the principal components with `scater::runUMAP()`. +The raw and log-normalized counts, list of 2000 high-variance genes, principal components, and UMAP embeddings are all stored in the "processed" object. ### Quantifying gene expression for libraries with CITE-seq or cell hashing - How we used alevin-fry to quantify ADT and HTO libraries