Workflow to assign consensus cell types to ScPCA samples #977

allyhawkins · 2025-01-08T20:16:29Z

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

Closes #969

What is the goal of this pull request?

Here I'm adding a workflow to assign the consensus cell type labels for all cells in ScPCA. This produces a single TSV file that contains all cells in all samples from ScPCA (except cell line samples) and all cell type annotations. Included in the table are the ids (project, sample, library), the original annotations from scpca-nf, the ontology id and name for the panglao term, and new columns for the consensus labels. Any combination that is present in consensus-cell-type-reference.tsv is assigned and then all other combinations get assigned to "Unknown" as the consensus_annotation and NA for the consensus_ontology.

Briefly describe the general approach you took to achieve this goal.

I wrote two short scripts to do this:

The first script reads in a single SCE object, grabs the cell type annotations from the colData, and saves it to a single TSV file.
The second script reads in a directory with all the individual TSV files from step 1 and combines those into a single data frame. This script also takes in the consensus reference file and adds in the consensus labels for any combinations in the reference. All other are "Unknown"

These two steps are combined into a single workflow that can be run on all samples in ScPCA.

A few things to note:

When making the final TSV file, I am also incorporating the blueprint_annotation_fine column, since that's the name that's directly linked to the ontology term in SingleR.
I also added in the columns with the assigned ontology and name for the original Panglao terms.
Since any samples with cell lines won't have columns for cell type annotations, I skip those. Are there any other scenarios I should account for?
I have not yet run this on the entire ScPCA dataset, but I tested it with a small set of samples. I also added the workflow to GHA here so it should run on all of the test data. Once the code is reviewed I will initiate the run to actually generate the TSV file and save it to S3.
Along those same lines, there are some small renv changes that are needed and since the GHA for this module runs with the Docker image, I'll file a separate PR to update that first. Until that gets in, I expect CI to fail.
I also updated the README to describe how to run the workflow.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Yes! Next up we will actually look at the results.

Results

What is the name of your results bucket on S3?

This will get saved to s3://researcher-211125375652-us-east-2/cell-type-consensus once I generate the file for all samples.

What types of results does your code produce (e.g., table, figure)?

A TSV file with all cells in ScPCA and all cell type annotations (SingleR, CellAssign, and consensus).

What is your summary of the results?

That is coming up next!

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

I was able to run this locally on a subset of samples.

Author checklists

Analysis module and review

This analysis module uses the analysis template and has the expected directory structure.
The analysis module README.md has been updated to reflect code changes in this pull request.
The analytical code is documented and contains comments.
Any results and/or plots this code produces have been added to your S3 bucket for review.

Reproducibility checklist

Code in this pull request has been added to the GitHub Action workflow that runs this module.
The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

…sign-scpca-consensus

jashapiro

This looks good overall, with a few caveats:

The first is the use of here, which could cause trouble in the future (if you made an R project for this module, which I actually recommend doing)
The other concern I have is about the addition of label.fine; I seem to recall that there were cases where there was not a 1:1 mapping from ontology to the fine label, which may result in duplicate rows (by cell id) in the output table.
A few other minor comments.

jashapiro · 2025-01-08T20:33:34Z

analyses/cell-type-consensus/README.md


-See the [`scripts/README.md`](./scripts/README.md) for instructions on running the scripts in this module. 
+The `assign-consensus-celltypes.sh` workflow can be used to assign a consensus cell type for all samples in ScPCA. 
+This workflow outputs a single TSV file with cell type annotations for all cells in ScPCA (excluding cell line samples). 


Just on wording, I don't know if we want to call this a workflow, as I keep thinking that will mean something more like Nextflow. I think you can just call it a script and describe the whole thing as the module output.

analyses/cell-type-consensus/assign-consensus-celltypes.sh

analyses/cell-type-consensus/scripts/03-save-coldata.R

jashapiro · 2025-01-08T21:46:59Z

analyses/cell-type-consensus/scripts/04-combine-celltype-tables.R

+# consensus_annotation: human readable name associated with the consensus label 
+# consensus_ontology: CL ontology term for the consensus cell type 
+
+project_root <- here::here()


I don't particularly like this here, as if an Rproject file gets added to the module, this will start to fail. It is generally better to use rprojroot and look explicitly for the git directory, or we could make it look for .github/, to prevent issues if the repo is checked out by the api (which would not include the .git directory).

so either

project_root <- rprojroot::find_root(rprojroot::is_git_root)

or maybe safer:

project_root <- rprojroot::find_root(rprojroot::has_dir(".github"))

Just noting that this fails in CI with the following error:

Error: No root directory found in /__w/OpenScPCA-analysis/OpenScPCA-analysis/analyses/cell-type-consensus or its parent directories. Root criterion: one of - contains a directory ".git" - contains a file ".git" with contents matching "^gitdir: " Execution halted

see https://github.com/AlexsLemonade/OpenScPCA-analysis/actions/runs/12694163418/job/35383339441

I'm looking into it, but posting it here in case you have any obvious solutions.

Yeah, this was the issue I was referring to with checking out by the API, which is what will happen with docker images that don't have git installed. (See https://github.com/AlexsLemonade/OpenScPCA-analysis/actions/runs/12694163418/job/35383339441#step:4:26).

You could add a step to install git, but maybe try the rprojroot::find_root(rprojroot::has_dir(".github")) option?

jashapiro · 2025-01-09T15:29:41Z

analyses/cell-type-consensus/scripts/04-combine-celltype-tables.R

+  blueprint_ontology = blueprint_ref$label.ont,
+  blueprint_annotation_fine = blueprint_ref$label.fine


Are there any cases where the ontology maps to more than one fine label? I feel like this is the case, which would result in duplicate rows when joining.

No. The ontology and fine label are 1:1. I checked this by making sure the length of the unique values in this data frame are equivalent and equal to the number of rows in the dataframe. Also from the celldex vignette:

Typically, each reference provides three levels of cell type annotation in its column metadata:

label.main, broad annotation that defines the major cell types. This has few unique levels that allows for fast annotation but at low resolution.
label.fine, fine-grained annotation that defines subtypes or states. This has more unique levels that results in slower annotation but at much higher resolution.
label.ont, fine-grained annotation mapped to the standard vocabulary in the Cell Ontology. This enables synchronization of labels across references as well as dynamic adjustment of the resolution.

This tells me that the label.fine are the names associated with the ontology IDs in the cell type ontology. I checked a few of these manually and they match up.

I read the last sentence a bit differently: they map to Cell Ontology but that does not mean that every fine label has a distinct ontology label.

In the case of Blueprint, it turns out that this is true, but it is not always; the HumanPrimaryCellAtlas data has more fine labels than ontologies.

> hpc_ref <- celldex::HumanPrimaryCellAtlasData() > length(unique(hpc_ref$label.fine)) [1] 157 > length(unique(hpc_ref$label.ont)) [1] 66

So I guess this is fine, but I guess I'm not sure what the value of it is? If we are just getting an alternative to the ontology label with a different naming convention, do we need it at all?

So I guess this is fine, but I guess I'm not sure what the value of it is? If we are just getting an alternative to the ontology label with a different naming convention, do we need it at all?

Perhaps the better approach here is to directly map the ontology ids to the name associated with them ourselves and remove the blueprint_annotation_fine column all together. I think I'll go ahead and do this but in the script that creates the consensus reference to begin with and not here. I'll file a new issue and new PR to address this before we actually run this on all ScPCA samples.

jashapiro · 2025-01-09T15:35:33Z

analyses/cell-type-consensus/scripts/04-combine-celltype-tables.R

+  make_option(
+    opt_str = c("--output_file"),
+    type = "character",
+    help = "Path to file where combined TSV file will be saved, must end in `.tsv`"
+  )


I assume the output file here is large: do we want to allow compressed file output? You don't have code to enforce the .tsv ending, and write_tsv supports this just by specifying the compression in file name (.tsv.gz, etc.) so just updating the help text would be sufficient here.

allyhawkins · 2025-01-09T16:30:22Z

@jashapiro thanks for the feedback here. I made the following changes based on your comments:

I updated the wording in the README to use "script" instead of "workflow".
I removed any here::here() and use rprojroot. We also do have an R project for this module already.
The module_dir variable has been removed from the shell script and I checked that everything still runs as expected (at least locally).
I modified the script that saves the combined files to check the file extension and allowed for either .tsv or .tsv.gz. I also updated the path to this file in the shell script to save a compressed version of the file.

The other change that I had to make was to modify some of the code for the 03-save-coldata.R script to accommodate multiplexed libraries. There I am combining the sample ids rather than worrying about choosing a sample id from the demultiplexing results. I probably don't even need the sample ids in the final TSV file, but I felt like it might come in handy in the future so I want to keep that column. The other thing I had to do was account for metadata(sample_type) to be a list.

This should be ready for another round of review.

jashapiro

Looks good, with a few little comments.

analyses/cell-type-consensus/assign-consensus-celltypes.sh

jashapiro · 2025-01-09T16:56:53Z

analyses/cell-type-consensus/scripts/03-save-coldata.R

+sample_id <- metadata(sce)$sample_id |> 
+  paste0(collapse = ";")


This seems like the correct solution (for now at least).

jashapiro · 2025-01-09T17:17:14Z

analyses/cell-type-consensus/scripts/04-combine-celltype-tables.R

+  blueprint_ontology = blueprint_ref$label.ont,
+  blueprint_annotation_fine = blueprint_ref$label.fine


I read the last sentence a bit differently: they map to Cell Ontology but that does not mean that every fine label has a distinct ontology label.

In the case of Blueprint, it turns out that this is true, but it is not always; the HumanPrimaryCellAtlas data has more fine labels than ontologies.

> hpc_ref <- celldex::HumanPrimaryCellAtlasData() > length(unique(hpc_ref$label.fine)) [1] 157 > length(unique(hpc_ref$label.ont)) [1] 66

So I guess this is fine, but I guess I'm not sure what the value of it is? If we are just getting an alternative to the ontology label with a different naming convention, do we need it at all?

Co-authored-by: Joshua Shapiro <[email protected]>

allyhawkins added 9 commits January 7, 2025 15:37

script to save coldata

79a3e51

keep original panglao name when creating reference

84094b5

shell script to assign cell types for all scpca projects

73d9498

account for cell line samples when saving tsv files

2a2f514

script to combine tsvs and add consensus

82632b7

update lock file

f3b6f78

add documentation

7965691

Merge remote-tracking branch 'AlexsLemonade/main' into allyhawkins/as…

e36f4c7

…sign-scpca-consensus

add to ci

3f9183d

allyhawkins requested a review from jaclyn-taroni as a code owner January 8, 2025 20:16

allyhawkins requested review from jashapiro and removed request for jaclyn-taroni January 8, 2025 20:16

allyhawkins changed the title ~~Allyhawkins/assign scpca consensus~~ Workflow to assign consensus cell types to ScPCA samples Jan 8, 2025

allyhawkins mentioned this pull request Jan 8, 2025

Add SingleCellExperiment to renv for consensus cell typing #978

Merged

jashapiro reviewed Jan 9, 2025

View reviewed changes

allyhawkins added 6 commits January 9, 2025 09:53

Merge branch 'main' into allyhawkins/assign-scpca-consensus

39b162d

change workflow -> script

2bbacc3

remove here::here

6f4f1a0

account for multiplexed libraries

6b7c342

remove module_dir variable

939103f

compress output

bcb16a5

allyhawkins requested a review from jashapiro January 9, 2025 16:30

jashapiro approved these changes Jan 9, 2025

View reviewed changes

This was referenced Jan 9, 2025

Use names from cell type ontology (CL) instead of blueprint names in consensus reference #979

Closed

Run the script to generate consensus cell types in ScPCA samples and save the output #980

Open

allyhawkins and others added 2 commits January 9, 2025 12:27

Make file search more specific

965e445

Co-authored-by: Joshua Shapiro <[email protected]>

try different rprojroot

d7b93ed

allyhawkins merged commit a18421e into AlexsLemonade:main Jan 10, 2025
3 checks passed

allyhawkins deleted the allyhawkins/assign-scpca-consensus branch January 10, 2025 16:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow to assign consensus cell types to ScPCA samples #977

Workflow to assign consensus cell types to ScPCA samples #977

allyhawkins commented Jan 8, 2025 •

edited

Loading

jashapiro left a comment

jashapiro Jan 8, 2025

jashapiro Jan 8, 2025

allyhawkins Jan 9, 2025

jashapiro Jan 9, 2025

jashapiro Jan 9, 2025

allyhawkins Jan 9, 2025

jashapiro Jan 9, 2025

allyhawkins Jan 9, 2025

jashapiro Jan 9, 2025

allyhawkins commented Jan 9, 2025

jashapiro left a comment

jashapiro Jan 9, 2025

jashapiro Jan 9, 2025

		blueprint_ontology = blueprint_ref$label.ont,
		blueprint_annotation_fine = blueprint_ref$label.fine

		sample_id <- metadata(sce)$sample_id \|>
		paste0(collapse = ";")

Workflow to assign consensus cell types to ScPCA samples #977

Workflow to assign consensus cell types to ScPCA samples #977

Conversation

allyhawkins commented Jan 8, 2025 • edited Loading

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

What is the goal of this pull request?

Briefly describe the general approach you took to achieve this goal.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Results

What is the name of your results bucket on S3?

What types of results does your code produce (e.g., table, figure)?

What is your summary of the results?

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

Author checklists

Analysis module and review

Reproducibility checklist

jashapiro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allyhawkins commented Jan 9, 2025

jashapiro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allyhawkins commented Jan 8, 2025 •

edited

Loading