Skip to content

Commit

Permalink
Merge pull request #977 from allyhawkins/allyhawkins/assign-scpca-con…
Browse files Browse the repository at this point in the history
…sensus

Workflow to assign consensus cell types to ScPCA samples
  • Loading branch information
allyhawkins authored Jan 10, 2025
2 parents c46722c + d7b93ed commit a18421e
Show file tree
Hide file tree
Showing 6 changed files with 293 additions and 9 deletions.
1 change: 1 addition & 0 deletions .github/workflows/run_cell-type-consensus.yml
Original file line number Diff line number Diff line change
Expand Up @@ -53,3 +53,4 @@ jobs:
run: |
cd ${MODULE_PATH}
# run module script(s) here
./assign-consensus-celltypes.sh
65 changes: 56 additions & 9 deletions analyses/cell-type-consensus/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
This module explores creating rules that can be used to identify a consensus cell type label.
Specifically, the cell type annotations obtained from both `SingleR` and `CellAssign` will be used to create a single cell type label in an ontology aware manner.

## Description
## Creating a reference for consensus cell types

The goal of this module is to create a reference that can be used to define an ontology aware consensus cell type label for all cells across all ScPCA samples.
This module performs a series of steps to accomplish that goal:
Expand All @@ -25,23 +25,70 @@ If that is the case all other LCA terms are removed and `hematopoietic precursor
- When the LCA is `epithelial cell` and the annotation from `BlueprintEncodeData` is `Epithelial cells`, then `epithelial cell` is used as the consensus label.
- If the LCA is `bone cell`, `lining cell`, `blood cell`, `progenitor cell`, or `supporting cell`, no consensus label is defined.

See the [`scripts/README.md`](./scripts/README.md) for instructions on running the individual scripts used to generate the reference.

## Usage
## Assigning consensus cell types for ScPCA samples

See the [`scripts/README.md`](./scripts/README.md) for instructions on running the scripts in this module.
The `assign-consensus-celltypes.sh` script can be used to assign a consensus cell type for all samples in ScPCA.
This script outputs a single TSV file with cell type annotations for all cells in ScPCA (excluding cell line samples).
Cell type annotations assigned using `SingleR` with the `BlueprintEncodeData` reference and `CellAssign` using the `PanglaoDB` reference are included along side the assigned consensus cell type annotation and ontology identifier.

## Input files
To run this script use the following command:

TBD
```sh
./assign-consensus-celltypes.sh
```

## Output files
### Input files

TBD

The `assign-consensus-celltypes.sh` script requires the processed `SingleCellExperiment` objects (`_processed.rds`) for all ScPCA samples.
These files were obtained using the `download-data.py` script:

```sh
# download SCE objects
./download-data.py
```

This script also requires two reference files, `panglao-cell-type-ontologies.tsv` and `consensus-cell-type-reference.tsv`.
See [Creating a reference for consensus cell types](#creating-a-reference-for-consensus-cell-types) and the [README.md in the references directory](./references/README.md) to learn more about the content of these files.

### Output files

Running the `assign-consensus-celltypes.sh` script will generate the following output files in `results`.

```
results
├── scpca-consensus-celltype-assignments.tsv
├── original-celltype-assignments
├── <library_id>_celltype-assignments.tsv
└── <library_id>_celltype-assignments.tsv
```

The `original-celltyp-assignments` folder contains a single TSV file for each library in ScPCA, except for libraries obtained from cell lines.
These TSV files have the cell type annotations from running `SingleR` and `CellAssign` that can be found in the `colData` of the processed SCE objects.

The `scpca-consensus-celltype-assignments.tsv` file contains cell type annotations for all cells in all ScPCA samples with the following columns:

| | |
| --- | --- |
| `project_id` | ScPCA project id |
| `sample_id` | ScPCA sample id |
| `library_id` | ScPCA library id |
| `barcodes` | cell barcode |
| `singler_celltype_ontology` | Cell type ontology term assigned by `SingleR` |
| `singler_celltype_annotation` | Name associated with cell type ontology term assigned by `SingleR`; this term is equivalent to the `label.main` term in the `BlueprintEncodeData` reference |
| `cellassign_celltype_annotation` | Cell type assigned by `CellAssign`; this term is the original term found in the `PanglaoDB` reference file |
| `panglao_ontology` | Cell type ontology term associated with the term found in `cellassign_celltype_annotation` column |
| `panglao_annotation` | Name associated with the cell type ontology term in `panglao_ontology` |
| `blueprint_annotation_fine` | Fine grained cell type annotation (`label.fine`) from `BlueprintEncodeData` associated with the `singler_celltype_ontology` term |
| `consensus_ontology` | Cell type ontology term assigned as the consensus cell type |
| `consensus_annotation` | Name associated with the assigned consensus cell type in `consensus_ontology` |

## Software requirements

TBD
This module uses `renv` to manage software dependencies.

## Computational resources

TBD
This module does not require compute beyond what is generally available on a laptop.
47 changes: 47 additions & 0 deletions analyses/cell-type-consensus/assign-consensus-celltypes.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
#!/bin/bash

# This script is used to create a single table with cell type assignments all cells from all ScPCA samples
# The existing cell type annotations from SingleR and CellAssign are saved to a TSV file for each sample
# Then all TSV files are combined into a single file and consensus cell types are assigned

# Usage: ./assign-consensus-celltypes.sh


set -euo pipefail

# navigate to where script lives
cd $(dirname "$0")
#module_dir=$(pwd)

data_dir="../../data/current"
# path to save consensus results
scpca_consensus_assignments_file="results/scpca-consensus-celltype-assignments.tsv.gz"
# directory to store all individual tsv files
celltype_tsv_dir="results/original-celltype-assignments"
mkdir -p ${celltype_tsv_dir}

# define reference input files
panglao_ref_file="references/panglao-cell-type-ontologies.tsv"
consensus_ref_file="references/consensus-cell-type-reference.tsv"

# run script to export tsv file on all processed objects
for sce_file in $data_dir/SCPCP*/SCPCS*/*_processed.rds; do

# define library ID
library_id=$(basename $sce_file | sed 's/_processed.rds$//')

echo "Grabbing cell types for ${library_id}"
# get celltypes as tsv file
Rscript scripts/03-save-coldata.R \
--sce_file $sce_file \
--output_file ${celltype_tsv_dir}/${library_id}_celltype-assignments.tsv

done

echo "Combining TSVs and adding consensus labels"
# run script to combine all tsv files and assign consensus cell types
Rscript scripts/04-combine-celltype-tables.R \
--celltype_tsv_dir $celltype_tsv_dir \
--panglao_ref_file $panglao_ref_file \
--consensus_ref_file $consensus_ref_file \
--output_file $scpca_consensus_assignments_file
76 changes: 76 additions & 0 deletions analyses/cell-type-consensus/scripts/03-save-coldata.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
#!/usr/bin/env Rscript

# This script is used to grab the colData from a SCE object and save it as a TSV file

library(optparse)

option_list <- list(
make_option(
opt_str = c("--sce_file"),
type = "character",
help = "Path to RDS file containing a processed SingleCellExperiment object from scpca-nf"
),
make_option(
opt_str = c("--output_file"),
type = "character",
help = "Path to file where colData will be saved, must end in `.tsv`"
)
)

# Parse options
opt <- parse_args(OptionParser(option_list = option_list))

# Set up -----------------------------------------------------------------------

# make sure input files exist
stopifnot(
"sce file does not exist" = file.exists(opt$sce_file)
)

# load SCE
suppressPackageStartupMessages({
library(SingleCellExperiment)
})

# Extract colData --------------------------------------------------------------

# read in sce
sce <- readr::read_rds(opt$sce_file)

# extract ids
library_id <- metadata(sce)$library_id
# account for multiplexed libraries that have multiple samples
# for now just combine sample ids into a single string and don't worry about demultiplexing
sample_id <- metadata(sce)$sample_id |>
paste0(collapse = ";")
project_id <- metadata(sce)$project_id

# check if cell line since cell lines don't have any cell type assignments
# account for having more than one sample and a list of sample types
# all sample types should be the same theoretically
is_cell_line <- all(metadata(sce)$sample_type == "cell line")

# only create and write table for non-cell line samples
if(!is_cell_line){

# get df with ids, barcodes, and cell type assignments
celltype_df <- colData(sce) |>
as.data.frame() |>
dplyr::mutate(
project_id = project_id,
sample_id = sample_id,
library_id = library_id
) |>
dplyr::select(
project_id,
sample_id,
library_id,
barcodes,
contains("celltype") # get both singler and cellassign with ontology
)

# save tsv
readr::write_tsv(celltype_df, opt$output_file)

}

108 changes: 108 additions & 0 deletions analyses/cell-type-consensus/scripts/04-combine-celltype-tables.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
#!/usr/bin/env Rscript

# This script is used to combine all TSV files containing cell types into a single TSV file
# The output TSV file will include the following added columns:
# panglao_ontology: CL term assigned to panglao term
# panglao_annotation: human readable value associated with the CL term for panglao term
# blueprint_annotation_fine: Fine-grained annotation from blueprint associated with singler_celltype_ontology
# consensus_annotation: human readable name associated with the consensus label
# consensus_ontology: CL ontology term for the consensus cell type

project_root <- rprojroot::find_root(rprojroot::has_dir(".github"))

library(optparse)

option_list <- list(
make_option(
opt_str = c("--celltype_tsv_dir"),
type = "character",
help = "Path to directory containing TSV files with cell type annotations from single samples.
All TSV files in this directory will be combined into a single file."
),
make_option(
opt_str = c("--panglao_ref_file"),
default = file.path(project_root, "references", "panglao-cell-type-ontologies.tsv"),
type = "character",
help = "Path to file with panglao assignments and associated cell ontology ids"
),
make_option(
opt_str = c("--consensus_ref_file"),
default = file.path(project_root, "references", "consensus-cell-type-reference.tsv"),
type = "character",
help = "Path to file containing the reference for assigning consensus cell type labels"
),
make_option(
opt_str = c("--output_file"),
type = "character",
help = "Path to file where combined TSV file will be saved.
File name must end in either `.tsv` or `.tsv.gz` to save a compressed TSV file"
)
)

# Parse options
opt <- parse_args(OptionParser(option_list = option_list))

# Prep ref files ---------------------------------------------------------------

# make sure reference files exist
stopifnot(
"panglao reference file does not exist" = file.exists(opt$panglao_ref_file),
"cell type consensus reference file does not exist" = file.exists(opt$consensus_ref_file),
"output file must end in `.tsv` or `.tsv.gz`" = stringr::str_detect(opt$output_file, ".tsv|.tsv.gz")
)

# read in ref files
# change names for panglao ref to match what's in the consensus file
panglao_ref_df <- readr::read_tsv(opt$panglao_ref_file) |>
dplyr::rename(
panglao_ontology = ontology_id,
panglao_annotation = human_readable_value,
original_panglao_name = panglao_cell_type
)

consensus_ref_df <- readr::read_tsv(opt$consensus_ref_file) |>
# select columns to use for joining and consensus assigmments
dplyr::select(
panglao_ontology,
original_panglao_name,
blueprint_ontology,
consensus_annotation,
consensus_ontology
)

# grab singler ref from celldex
blueprint_ref <- celldex::BlueprintEncodeData()

# get ontologies and human readable name into data frame for blueprint
# in scpca-nf we don't include the fine label so this lets us add it in
blueprint_df <- data.frame(
blueprint_ontology = blueprint_ref$label.ont,
blueprint_annotation_fine = blueprint_ref$label.fine
) |>
unique() |>
tidyr::drop_na()

# get list of all TSV files
all_files <- list.files(path = opt$celltype_tsv_dir,
pattern = "*.tsv",
full.names = TRUE)

# read in TSV files and combine into a single df
all_cells_df <- all_files |>
purrr::map(readr::read_tsv) |>
dplyr::bind_rows() |>
# add columns for panglao ontology and consensus
# first add panglao ontology
dplyr::left_join(panglao_ref_df, by = c("cellassign_celltype_annotation" = "original_panglao_name")) |>
# now add in all the blueprint columns
dplyr::left_join(blueprint_df, by = c("singler_celltype_ontology" = "blueprint_ontology")) |>
# then add consensus labels
dplyr::left_join(consensus_ref_df,
by = c("singler_celltype_ontology" = "blueprint_ontology",
"cellassign_celltype_annotation" = "original_panglao_name",
"panglao_ontology")) |>
# use unknown for NA annotation but keep ontology ID as NA
dplyr::mutate(consensus_annotation = dplyr::if_else(is.na(consensus_annotation), "Unknown", consensus_annotation))

# export file
readr::write_tsv(all_cells_df, opt$output_file)
5 changes: 5 additions & 0 deletions analyses/cell-type-consensus/scripts/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,8 @@ Ontology terms and labels along with the `cell type` label from the reference fi
3. `02-prepare-consensus-reference.R`: This script is used to create a table with all consensus cell types.
The output table will contain one row for each combination of cell types in `PanglaoDB` and `BlueprintEncodeData` from `celldex` where a consensus cell type was identified.
If the combination is not included in the reference file, then no consensus cell type is assigned and can be set to "Unknown".

4. `03-save-coldata.R`: This script is used to grab the cell type annotations from the `colData` of an individual processed SCE object and save the output to a TSV file.

5. `04-combine-celltype-tables.R`: This script is used to combine individual TSV files with cell type annotations (output by `03-save-coldata.R`) into a single TSV file.
The consensus cell type reference is used to assign consensus cell types to all cells in the combined data frame and saved in the output TSV file.

0 comments on commit a18421e

Please sign in to comment.