diff --git a/.github/workflows/run_cell-type-consensus.yml b/.github/workflows/run_cell-type-consensus.yml index e3d110e04..ddd13b0b9 100644 --- a/.github/workflows/run_cell-type-consensus.yml +++ b/.github/workflows/run_cell-type-consensus.yml @@ -53,3 +53,4 @@ jobs: run: | cd ${MODULE_PATH} # run module script(s) here + ./assign-consensus-celltypes.sh diff --git a/analyses/cell-type-consensus/README.md b/analyses/cell-type-consensus/README.md index 3346815b9..9789083d4 100644 --- a/analyses/cell-type-consensus/README.md +++ b/analyses/cell-type-consensus/README.md @@ -3,7 +3,7 @@ This module explores creating rules that can be used to identify a consensus cell type label. Specifically, the cell type annotations obtained from both `SingleR` and `CellAssign` will be used to create a single cell type label in an ontology aware manner. -## Description +## Creating a reference for consensus cell types The goal of this module is to create a reference that can be used to define an ontology aware consensus cell type label for all cells across all ScPCA samples. This module performs a series of steps to accomplish that goal: @@ -25,23 +25,70 @@ If that is the case all other LCA terms are removed and `hematopoietic precursor - When the LCA is `epithelial cell` and the annotation from `BlueprintEncodeData` is `Epithelial cells`, then `epithelial cell` is used as the consensus label. - If the LCA is `bone cell`, `lining cell`, `blood cell`, `progenitor cell`, or `supporting cell`, no consensus label is defined. +See the [`scripts/README.md`](./scripts/README.md) for instructions on running the individual scripts used to generate the reference. -## Usage +## Assigning consensus cell types for ScPCA samples -See the [`scripts/README.md`](./scripts/README.md) for instructions on running the scripts in this module. +The `assign-consensus-celltypes.sh` script can be used to assign a consensus cell type for all samples in ScPCA. +This script outputs a single TSV file with cell type annotations for all cells in ScPCA (excluding cell line samples). +Cell type annotations assigned using `SingleR` with the `BlueprintEncodeData` reference and `CellAssign` using the `PanglaoDB` reference are included along side the assigned consensus cell type annotation and ontology identifier. -## Input files +To run this script use the following command: -TBD +```sh +./assign-consensus-celltypes.sh +``` -## Output files +### Input files -TBD + +The `assign-consensus-celltypes.sh` script requires the processed `SingleCellExperiment` objects (`_processed.rds`) for all ScPCA samples. +These files were obtained using the `download-data.py` script: + +```sh +# download SCE objects +./download-data.py +``` + +This script also requires two reference files, `panglao-cell-type-ontologies.tsv` and `consensus-cell-type-reference.tsv`. +See [Creating a reference for consensus cell types](#creating-a-reference-for-consensus-cell-types) and the [README.md in the references directory](./references/README.md) to learn more about the content of these files. + +### Output files + +Running the `assign-consensus-celltypes.sh` script will generate the following output files in `results`. + +``` +results +├── scpca-consensus-celltype-assignments.tsv +├── original-celltype-assignments + ├── _celltype-assignments.tsv + └── _celltype-assignments.tsv +``` + +The `original-celltyp-assignments` folder contains a single TSV file for each library in ScPCA, except for libraries obtained from cell lines. +These TSV files have the cell type annotations from running `SingleR` and `CellAssign` that can be found in the `colData` of the processed SCE objects. + +The `scpca-consensus-celltype-assignments.tsv` file contains cell type annotations for all cells in all ScPCA samples with the following columns: + +| | | +| --- | --- | +| `project_id` | ScPCA project id | +| `sample_id` | ScPCA sample id | +| `library_id` | ScPCA library id | +| `barcodes` | cell barcode | +| `singler_celltype_ontology` | Cell type ontology term assigned by `SingleR` | +| `singler_celltype_annotation` | Name associated with cell type ontology term assigned by `SingleR`; this term is equivalent to the `label.main` term in the `BlueprintEncodeData` reference | +| `cellassign_celltype_annotation` | Cell type assigned by `CellAssign`; this term is the original term found in the `PanglaoDB` reference file | +| `panglao_ontology` | Cell type ontology term associated with the term found in `cellassign_celltype_annotation` column | +| `panglao_annotation` | Name associated with the cell type ontology term in `panglao_ontology` | +| `blueprint_annotation_fine` | Fine grained cell type annotation (`label.fine`) from `BlueprintEncodeData` associated with the `singler_celltype_ontology` term | +| `consensus_ontology` | Cell type ontology term assigned as the consensus cell type | +| `consensus_annotation` | Name associated with the assigned consensus cell type in `consensus_ontology` | ## Software requirements -TBD +This module uses `renv` to manage software dependencies. ## Computational resources -TBD +This module does not require compute beyond what is generally available on a laptop. diff --git a/analyses/cell-type-consensus/assign-consensus-celltypes.sh b/analyses/cell-type-consensus/assign-consensus-celltypes.sh new file mode 100755 index 000000000..83897d93b --- /dev/null +++ b/analyses/cell-type-consensus/assign-consensus-celltypes.sh @@ -0,0 +1,47 @@ +#!/bin/bash + +# This script is used to create a single table with cell type assignments all cells from all ScPCA samples +# The existing cell type annotations from SingleR and CellAssign are saved to a TSV file for each sample +# Then all TSV files are combined into a single file and consensus cell types are assigned + +# Usage: ./assign-consensus-celltypes.sh + + +set -euo pipefail + +# navigate to where script lives +cd $(dirname "$0") +#module_dir=$(pwd) + +data_dir="../../data/current" +# path to save consensus results +scpca_consensus_assignments_file="results/scpca-consensus-celltype-assignments.tsv.gz" +# directory to store all individual tsv files +celltype_tsv_dir="results/original-celltype-assignments" +mkdir -p ${celltype_tsv_dir} + +# define reference input files +panglao_ref_file="references/panglao-cell-type-ontologies.tsv" +consensus_ref_file="references/consensus-cell-type-reference.tsv" + +# run script to export tsv file on all processed objects +for sce_file in $data_dir/SCPCP*/SCPCS*/*_processed.rds; do + + # define library ID + library_id=$(basename $sce_file | sed 's/_processed.rds$//') + + echo "Grabbing cell types for ${library_id}" + # get celltypes as tsv file + Rscript scripts/03-save-coldata.R \ + --sce_file $sce_file \ + --output_file ${celltype_tsv_dir}/${library_id}_celltype-assignments.tsv + +done + +echo "Combining TSVs and adding consensus labels" +# run script to combine all tsv files and assign consensus cell types +Rscript scripts/04-combine-celltype-tables.R \ + --celltype_tsv_dir $celltype_tsv_dir \ + --panglao_ref_file $panglao_ref_file \ + --consensus_ref_file $consensus_ref_file \ + --output_file $scpca_consensus_assignments_file diff --git a/analyses/cell-type-consensus/scripts/03-save-coldata.R b/analyses/cell-type-consensus/scripts/03-save-coldata.R new file mode 100644 index 000000000..84d431b11 --- /dev/null +++ b/analyses/cell-type-consensus/scripts/03-save-coldata.R @@ -0,0 +1,76 @@ +#!/usr/bin/env Rscript + +# This script is used to grab the colData from a SCE object and save it as a TSV file + +library(optparse) + +option_list <- list( + make_option( + opt_str = c("--sce_file"), + type = "character", + help = "Path to RDS file containing a processed SingleCellExperiment object from scpca-nf" + ), + make_option( + opt_str = c("--output_file"), + type = "character", + help = "Path to file where colData will be saved, must end in `.tsv`" + ) +) + +# Parse options +opt <- parse_args(OptionParser(option_list = option_list)) + +# Set up ----------------------------------------------------------------------- + +# make sure input files exist +stopifnot( + "sce file does not exist" = file.exists(opt$sce_file) +) + +# load SCE +suppressPackageStartupMessages({ + library(SingleCellExperiment) +}) + +# Extract colData -------------------------------------------------------------- + +# read in sce +sce <- readr::read_rds(opt$sce_file) + +# extract ids +library_id <- metadata(sce)$library_id +# account for multiplexed libraries that have multiple samples +# for now just combine sample ids into a single string and don't worry about demultiplexing +sample_id <- metadata(sce)$sample_id |> + paste0(collapse = ";") +project_id <- metadata(sce)$project_id + +# check if cell line since cell lines don't have any cell type assignments +# account for having more than one sample and a list of sample types +# all sample types should be the same theoretically +is_cell_line <- all(metadata(sce)$sample_type == "cell line") + +# only create and write table for non-cell line samples +if(!is_cell_line){ + + # get df with ids, barcodes, and cell type assignments + celltype_df <- colData(sce) |> + as.data.frame() |> + dplyr::mutate( + project_id = project_id, + sample_id = sample_id, + library_id = library_id + ) |> + dplyr::select( + project_id, + sample_id, + library_id, + barcodes, + contains("celltype") # get both singler and cellassign with ontology + ) + + # save tsv + readr::write_tsv(celltype_df, opt$output_file) + +} + diff --git a/analyses/cell-type-consensus/scripts/04-combine-celltype-tables.R b/analyses/cell-type-consensus/scripts/04-combine-celltype-tables.R new file mode 100644 index 000000000..453f2d4a9 --- /dev/null +++ b/analyses/cell-type-consensus/scripts/04-combine-celltype-tables.R @@ -0,0 +1,108 @@ +#!/usr/bin/env Rscript + +# This script is used to combine all TSV files containing cell types into a single TSV file +# The output TSV file will include the following added columns: +# panglao_ontology: CL term assigned to panglao term +# panglao_annotation: human readable value associated with the CL term for panglao term +# blueprint_annotation_fine: Fine-grained annotation from blueprint associated with singler_celltype_ontology +# consensus_annotation: human readable name associated with the consensus label +# consensus_ontology: CL ontology term for the consensus cell type + +project_root <- rprojroot::find_root(rprojroot::has_dir(".github")) + +library(optparse) + +option_list <- list( + make_option( + opt_str = c("--celltype_tsv_dir"), + type = "character", + help = "Path to directory containing TSV files with cell type annotations from single samples. + All TSV files in this directory will be combined into a single file." + ), + make_option( + opt_str = c("--panglao_ref_file"), + default = file.path(project_root, "references", "panglao-cell-type-ontologies.tsv"), + type = "character", + help = "Path to file with panglao assignments and associated cell ontology ids" + ), + make_option( + opt_str = c("--consensus_ref_file"), + default = file.path(project_root, "references", "consensus-cell-type-reference.tsv"), + type = "character", + help = "Path to file containing the reference for assigning consensus cell type labels" + ), + make_option( + opt_str = c("--output_file"), + type = "character", + help = "Path to file where combined TSV file will be saved. + File name must end in either `.tsv` or `.tsv.gz` to save a compressed TSV file" + ) +) + +# Parse options +opt <- parse_args(OptionParser(option_list = option_list)) + +# Prep ref files --------------------------------------------------------------- + +# make sure reference files exist +stopifnot( + "panglao reference file does not exist" = file.exists(opt$panglao_ref_file), + "cell type consensus reference file does not exist" = file.exists(opt$consensus_ref_file), + "output file must end in `.tsv` or `.tsv.gz`" = stringr::str_detect(opt$output_file, ".tsv|.tsv.gz") +) + +# read in ref files +# change names for panglao ref to match what's in the consensus file +panglao_ref_df <- readr::read_tsv(opt$panglao_ref_file) |> + dplyr::rename( + panglao_ontology = ontology_id, + panglao_annotation = human_readable_value, + original_panglao_name = panglao_cell_type + ) + +consensus_ref_df <- readr::read_tsv(opt$consensus_ref_file) |> + # select columns to use for joining and consensus assigmments + dplyr::select( + panglao_ontology, + original_panglao_name, + blueprint_ontology, + consensus_annotation, + consensus_ontology + ) + +# grab singler ref from celldex +blueprint_ref <- celldex::BlueprintEncodeData() + +# get ontologies and human readable name into data frame for blueprint +# in scpca-nf we don't include the fine label so this lets us add it in +blueprint_df <- data.frame( + blueprint_ontology = blueprint_ref$label.ont, + blueprint_annotation_fine = blueprint_ref$label.fine +) |> + unique() |> + tidyr::drop_na() + +# get list of all TSV files +all_files <- list.files(path = opt$celltype_tsv_dir, + pattern = "*.tsv", + full.names = TRUE) + +# read in TSV files and combine into a single df +all_cells_df <- all_files |> + purrr::map(readr::read_tsv) |> + dplyr::bind_rows() |> + # add columns for panglao ontology and consensus + # first add panglao ontology + dplyr::left_join(panglao_ref_df, by = c("cellassign_celltype_annotation" = "original_panglao_name")) |> + # now add in all the blueprint columns + dplyr::left_join(blueprint_df, by = c("singler_celltype_ontology" = "blueprint_ontology")) |> + # then add consensus labels + dplyr::left_join(consensus_ref_df, + by = c("singler_celltype_ontology" = "blueprint_ontology", + "cellassign_celltype_annotation" = "original_panglao_name", + "panglao_ontology")) |> + # use unknown for NA annotation but keep ontology ID as NA + dplyr::mutate(consensus_annotation = dplyr::if_else(is.na(consensus_annotation), "Unknown", consensus_annotation)) + +# export file +readr::write_tsv(all_cells_df, opt$output_file) diff --git a/analyses/cell-type-consensus/scripts/README.md b/analyses/cell-type-consensus/scripts/README.md index e61b7e5cc..c30058410 100644 --- a/analyses/cell-type-consensus/scripts/README.md +++ b/analyses/cell-type-consensus/scripts/README.md @@ -13,3 +13,8 @@ Ontology terms and labels along with the `cell type` label from the reference fi 3. `02-prepare-consensus-reference.R`: This script is used to create a table with all consensus cell types. The output table will contain one row for each combination of cell types in `PanglaoDB` and `BlueprintEncodeData` from `celldex` where a consensus cell type was identified. If the combination is not included in the reference file, then no consensus cell type is assigned and can be set to "Unknown". + +4. `03-save-coldata.R`: This script is used to grab the cell type annotations from the `colData` of an individual processed SCE object and save the output to a TSV file. + +5. `04-combine-celltype-tables.R`: This script is used to combine individual TSV files with cell type annotations (output by `03-save-coldata.R`) into a single TSV file. +The consensus cell type reference is used to assign consensus cell types to all cells in the combined data frame and saved in the output TSV file.