Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow to assign consensus cell types to ScPCA samples #977

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/run_cell-type-consensus.yml
Original file line number Diff line number Diff line change
Expand Up @@ -53,3 +53,4 @@ jobs:
run: |
cd ${MODULE_PATH}
# run module script(s) here
./assign-consensus-celltypes.sh
65 changes: 56 additions & 9 deletions analyses/cell-type-consensus/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
This module explores creating rules that can be used to identify a consensus cell type label.
Specifically, the cell type annotations obtained from both `SingleR` and `CellAssign` will be used to create a single cell type label in an ontology aware manner.

## Description
## Creating a reference for consensus cell types

The goal of this module is to create a reference that can be used to define an ontology aware consensus cell type label for all cells across all ScPCA samples.
This module performs a series of steps to accomplish that goal:
Expand All @@ -25,23 +25,70 @@ If that is the case all other LCA terms are removed and `hematopoietic precursor
- When the LCA is `epithelial cell` and the annotation from `BlueprintEncodeData` is `Epithelial cells`, then `epithelial cell` is used as the consensus label.
- If the LCA is `bone cell`, `lining cell`, `blood cell`, `progenitor cell`, or `supporting cell`, no consensus label is defined.

See the [`scripts/README.md`](./scripts/README.md) for instructions on running the individual scripts used to generate the reference.

## Usage
## Assigning consensus cell types for ScPCA samples

See the [`scripts/README.md`](./scripts/README.md) for instructions on running the scripts in this module.
The `assign-consensus-celltypes.sh` script can be used to assign a consensus cell type for all samples in ScPCA.
This script outputs a single TSV file with cell type annotations for all cells in ScPCA (excluding cell line samples).
Cell type annotations assigned using `SingleR` with the `BlueprintEncodeData` reference and `CellAssign` using the `PanglaoDB` reference are included along side the assigned consensus cell type annotation and ontology identifier.

## Input files
To run this script use the following command:

TBD
```sh
./assign-consensus-celltypes.sh
```

## Output files
### Input files

TBD

The `assign-consensus-celltypes.sh` script requires the processed `SingleCellExperiment` objects (`_processed.rds`) for all ScPCA samples.
These files were obtained using the `download-data.py` script:

```sh
# download SCE objects
./download-data.py
```

This script also requires two reference files, `panglao-cell-type-ontologies.tsv` and `consensus-cell-type-reference.tsv`.
See [Creating a reference for consensus cell types](#creating-a-reference-for-consensus-cell-types) and the [README.md in the references directory](./references/README.md) to learn more about the content of these files.

### Output files

Running the `assign-consensus-celltypes.sh` script will generate the following output files in `results`.

```
results
├── scpca-consensus-celltype-assignments.tsv
├── original-celltype-assignments
├── <library_id>_celltype-assignments.tsv
└── <library_id>_celltype-assignments.tsv
```

The `original-celltyp-assignments` folder contains a single TSV file for each library in ScPCA, except for libraries obtained from cell lines.
These TSV files have the cell type annotations from running `SingleR` and `CellAssign` that can be found in the `colData` of the processed SCE objects.

The `scpca-consensus-celltype-assignments.tsv` file contains cell type annotations for all cells in all ScPCA samples with the following columns:

| | |
| --- | --- |
| `project_id` | ScPCA project id |
| `sample_id` | ScPCA sample id |
| `library_id` | ScPCA library id |
| `barcodes` | cell barcode |
| `singler_celltype_ontology` | Cell type ontology term assigned by `SingleR` |
| `singler_celltype_annotation` | Name associated with cell type ontology term assigned by `SingleR`; this term is equivalent to the `label.main` term in the `BlueprintEncodeData` reference |
| `cellassign_celltype_annotation` | Cell type assigned by `CellAssign`; this term is the original term found in the `PanglaoDB` reference file |
| `panglao_ontology` | Cell type ontology term associated with the term found in `cellassign_celltype_annotation` column |
| `panglao_annotation` | Name associated with the cell type ontology term in `panglao_ontology` |
| `blueprint_annotation_fine` | Fine grained cell type annotation (`label.fine`) from `BlueprintEncodeData` associated with the `singler_celltype_ontology` term |
| `consensus_ontology` | Cell type ontology term assigned as the consensus cell type |
| `consensus_annotation` | Name associated with the assigned consensus cell type in `consensus_ontology` |

## Software requirements

TBD
This module uses `renv` to manage software dependencies.

## Computational resources

TBD
This module does not require compute beyond what is generally available on a laptop.
47 changes: 47 additions & 0 deletions analyses/cell-type-consensus/assign-consensus-celltypes.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
#!/bin/bash

# This script is used to create a single table with cell type assignments all cells from all ScPCA samples
# The existing cell type annotations from SingleR and CellAssign are saved to a TSV file for each sample
# Then all TSV files are combined into a single file and consensus cell types are assigned

# Usage: ./assign-consensus-celltypes.sh


set -euo pipefail

# navigate to where script lives
cd $(dirname "$0")
#module_dir=$(pwd)

data_dir="../../data/current"
# path to save consensus results
scpca_consensus_assignments_file="results/scpca-consensus-celltype-assignments.tsv.gz"
# directory to store all individual tsv files
celltype_tsv_dir="results/original-celltype-assignments"
mkdir -p ${celltype_tsv_dir}

# define reference input files
panglao_ref_file="references/panglao-cell-type-ontologies.tsv"
consensus_ref_file="references/consensus-cell-type-reference.tsv"

# run script to export tsv file on all processed objects
for sce_file in $data_dir/SCPCP*/SCPCS*/*_processed.rds; do

# define library ID
library_id=$(basename $sce_file | sed 's/_processed.rds$//')

echo "Grabbing cell types for ${library_id}"
# get celltypes as tsv file
Rscript scripts/03-save-coldata.R \
--sce_file $sce_file \
--output_file ${celltype_tsv_dir}/${library_id}_celltype-assignments.tsv

done

echo "Combining TSVs and adding consensus labels"
# run script to combine all tsv files and assign consensus cell types
Rscript scripts/04-combine-celltype-tables.R \
--celltype_tsv_dir $celltype_tsv_dir \
--panglao_ref_file $panglao_ref_file \
--consensus_ref_file $consensus_ref_file \
--output_file $scpca_consensus_assignments_file
76 changes: 76 additions & 0 deletions analyses/cell-type-consensus/scripts/03-save-coldata.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
#!/usr/bin/env Rscript

# This script is used to grab the colData from a SCE object and save it as a TSV file

library(optparse)

option_list <- list(
make_option(
opt_str = c("--sce_file"),
type = "character",
help = "Path to RDS file containing a processed SingleCellExperiment object from scpca-nf"
),
make_option(
opt_str = c("--output_file"),
type = "character",
help = "Path to file where colData will be saved, must end in `.tsv`"
)
)

# Parse options
opt <- parse_args(OptionParser(option_list = option_list))

# Set up -----------------------------------------------------------------------

# make sure input files exist
stopifnot(
"sce file does not exist" = file.exists(opt$sce_file)
)

# load SCE
suppressPackageStartupMessages({
library(SingleCellExperiment)
})

# Extract colData --------------------------------------------------------------

# read in sce
sce <- readr::read_rds(opt$sce_file)

# extract ids
library_id <- metadata(sce)$library_id
# account for multiplexed libraries that have multiple samples
# for now just combine sample ids into a single string and don't worry about demultiplexing
sample_id <- metadata(sce)$sample_id |>
paste0(collapse = ";")
Comment on lines +44 to +45
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the correct solution (for now at least).

project_id <- metadata(sce)$project_id

# check if cell line since cell lines don't have any cell type assignments
# account for having more than one sample and a list of sample types
# all sample types should be the same theoretically
is_cell_line <- all(metadata(sce)$sample_type == "cell line")

# only create and write table for non-cell line samples
if(!is_cell_line){

# get df with ids, barcodes, and cell type assignments
celltype_df <- colData(sce) |>
as.data.frame() |>
dplyr::mutate(
project_id = project_id,
sample_id = sample_id,
library_id = library_id
) |>
dplyr::select(
project_id,
sample_id,
library_id,
barcodes,
contains("celltype") # get both singler and cellassign with ontology
)

# save tsv
readr::write_tsv(celltype_df, opt$output_file)

}

108 changes: 108 additions & 0 deletions analyses/cell-type-consensus/scripts/04-combine-celltype-tables.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
#!/usr/bin/env Rscript

# This script is used to combine all TSV files containing cell types into a single TSV file
# The output TSV file will include the following added columns:
# panglao_ontology: CL term assigned to panglao term
# panglao_annotation: human readable value associated with the CL term for panglao term
# blueprint_annotation_fine: Fine-grained annotation from blueprint associated with singler_celltype_ontology
# consensus_annotation: human readable name associated with the consensus label
# consensus_ontology: CL ontology term for the consensus cell type

project_root <- rprojroot::find_root(rprojroot::has_dir(".github"))

library(optparse)

option_list <- list(
make_option(
opt_str = c("--celltype_tsv_dir"),
type = "character",
help = "Path to directory containing TSV files with cell type annotations from single samples.
All TSV files in this directory will be combined into a single file."
),
make_option(
opt_str = c("--panglao_ref_file"),
default = file.path(project_root, "references", "panglao-cell-type-ontologies.tsv"),
type = "character",
help = "Path to file with panglao assignments and associated cell ontology ids"
),
make_option(
opt_str = c("--consensus_ref_file"),
default = file.path(project_root, "references", "consensus-cell-type-reference.tsv"),
type = "character",
help = "Path to file containing the reference for assigning consensus cell type labels"
),
make_option(
opt_str = c("--output_file"),
type = "character",
help = "Path to file where combined TSV file will be saved.
File name must end in either `.tsv` or `.tsv.gz` to save a compressed TSV file"
)
)

# Parse options
opt <- parse_args(OptionParser(option_list = option_list))

# Prep ref files ---------------------------------------------------------------

# make sure reference files exist
stopifnot(
"panglao reference file does not exist" = file.exists(opt$panglao_ref_file),
"cell type consensus reference file does not exist" = file.exists(opt$consensus_ref_file),
"output file must end in `.tsv` or `.tsv.gz`" = stringr::str_detect(opt$output_file, ".tsv|.tsv.gz")
)

# read in ref files
# change names for panglao ref to match what's in the consensus file
panglao_ref_df <- readr::read_tsv(opt$panglao_ref_file) |>
dplyr::rename(
panglao_ontology = ontology_id,
panglao_annotation = human_readable_value,
original_panglao_name = panglao_cell_type
)

consensus_ref_df <- readr::read_tsv(opt$consensus_ref_file) |>
# select columns to use for joining and consensus assigmments
dplyr::select(
panglao_ontology,
original_panglao_name,
blueprint_ontology,
consensus_annotation,
consensus_ontology
)

# grab singler ref from celldex
blueprint_ref <- celldex::BlueprintEncodeData()

# get ontologies and human readable name into data frame for blueprint
# in scpca-nf we don't include the fine label so this lets us add it in
blueprint_df <- data.frame(
blueprint_ontology = blueprint_ref$label.ont,
blueprint_annotation_fine = blueprint_ref$label.fine
Comment on lines +79 to +80
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any cases where the ontology maps to more than one fine label? I feel like this is the case, which would result in duplicate rows when joining.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The ontology and fine label are 1:1. I checked this by making sure the length of the unique values in this data frame are equivalent and equal to the number of rows in the dataframe. Also from the celldex vignette:

Typically, each reference provides three levels of cell type annotation in its column metadata:

label.main, broad annotation that defines the major cell types. This has few unique levels that allows for fast annotation but at low resolution.
label.fine, fine-grained annotation that defines subtypes or states. This has more unique levels that results in slower annotation but at much higher resolution.
label.ont, fine-grained annotation mapped to the standard vocabulary in the Cell Ontology. This enables synchronization of labels across references as well as dynamic adjustment of the resolution.

This tells me that the label.fine are the names associated with the ontology IDs in the cell type ontology. I checked a few of these manually and they match up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the last sentence a bit differently: they map to Cell Ontology but that does not mean that every fine label has a distinct ontology label.

In the case of Blueprint, it turns out that this is true, but it is not always; the HumanPrimaryCellAtlas data has more fine labels than ontologies.

> hpc_ref <- celldex::HumanPrimaryCellAtlasData()
> length(unique(hpc_ref$label.fine))
[1] 157
> length(unique(hpc_ref$label.ont))
[1] 66

So I guess this is fine, but I guess I'm not sure what the value of it is? If we are just getting an alternative to the ontology label with a different naming convention, do we need it at all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess this is fine, but I guess I'm not sure what the value of it is? If we are just getting an alternative to the ontology label with a different naming convention, do we need it at all?

Perhaps the better approach here is to directly map the ontology ids to the name associated with them ourselves and remove the blueprint_annotation_fine column all together. I think I'll go ahead and do this but in the script that creates the consensus reference to begin with and not here. I'll file a new issue and new PR to address this before we actually run this on all ScPCA samples.

) |>
unique() |>
tidyr::drop_na()

# get list of all TSV files
all_files <- list.files(path = opt$celltype_tsv_dir,
pattern = "*.tsv",
full.names = TRUE)

# read in TSV files and combine into a single df
all_cells_df <- all_files |>
purrr::map(readr::read_tsv) |>
dplyr::bind_rows() |>
# add columns for panglao ontology and consensus
# first add panglao ontology
dplyr::left_join(panglao_ref_df, by = c("cellassign_celltype_annotation" = "original_panglao_name")) |>
# now add in all the blueprint columns
dplyr::left_join(blueprint_df, by = c("singler_celltype_ontology" = "blueprint_ontology")) |>
# then add consensus labels
dplyr::left_join(consensus_ref_df,
by = c("singler_celltype_ontology" = "blueprint_ontology",
"cellassign_celltype_annotation" = "original_panglao_name",
"panglao_ontology")) |>
# use unknown for NA annotation but keep ontology ID as NA
dplyr::mutate(consensus_annotation = dplyr::if_else(is.na(consensus_annotation), "Unknown", consensus_annotation))

# export file
readr::write_tsv(all_cells_df, opt$output_file)
5 changes: 5 additions & 0 deletions analyses/cell-type-consensus/scripts/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,8 @@ Ontology terms and labels along with the `cell type` label from the reference fi
3. `02-prepare-consensus-reference.R`: This script is used to create a table with all consensus cell types.
The output table will contain one row for each combination of cell types in `PanglaoDB` and `BlueprintEncodeData` from `celldex` where a consensus cell type was identified.
If the combination is not included in the reference file, then no consensus cell type is assigned and can be set to "Unknown".

4. `03-save-coldata.R`: This script is used to grab the cell type annotations from the `colData` of an individual processed SCE object and save the output to a TSV file.

5. `04-combine-celltype-tables.R`: This script is used to combine individual TSV files with cell type annotations (output by `03-save-coldata.R`) into a single TSV file.
The consensus cell type reference is used to assign consensus cell types to all cells in the combined data frame and saved in the output TSV file.
Loading