Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow to assign consensus cell types to ScPCA samples #977

Merged

Conversation

allyhawkins
Copy link
Member

@allyhawkins allyhawkins commented Jan 8, 2025

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

Closes #969

What is the goal of this pull request?

Here I'm adding a workflow to assign the consensus cell type labels for all cells in ScPCA. This produces a single TSV file that contains all cells in all samples from ScPCA (except cell line samples) and all cell type annotations. Included in the table are the ids (project, sample, library), the original annotations from scpca-nf, the ontology id and name for the panglao term, and new columns for the consensus labels. Any combination that is present in consensus-cell-type-reference.tsv is assigned and then all other combinations get assigned to "Unknown" as the consensus_annotation and NA for the consensus_ontology.

Briefly describe the general approach you took to achieve this goal.

I wrote two short scripts to do this:

  • The first script reads in a single SCE object, grabs the cell type annotations from the colData, and saves it to a single TSV file.
  • The second script reads in a directory with all the individual TSV files from step 1 and combines those into a single data frame. This script also takes in the consensus reference file and adds in the consensus labels for any combinations in the reference. All other are "Unknown"

These two steps are combined into a single workflow that can be run on all samples in ScPCA.

A few things to note:

  • When making the final TSV file, I am also incorporating the blueprint_annotation_fine column, since that's the name that's directly linked to the ontology term in SingleR.
  • I also added in the columns with the assigned ontology and name for the original Panglao terms.
  • Since any samples with cell lines won't have columns for cell type annotations, I skip those. Are there any other scenarios I should account for?
  • I have not yet run this on the entire ScPCA dataset, but I tested it with a small set of samples. I also added the workflow to GHA here so it should run on all of the test data. Once the code is reviewed I will initiate the run to actually generate the TSV file and save it to S3.
  • Along those same lines, there are some small renv changes that are needed and since the GHA for this module runs with the Docker image, I'll file a separate PR to update that first. Until that gets in, I expect CI to fail.
  • I also updated the README to describe how to run the workflow.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Yes! Next up we will actually look at the results.

Results

What is the name of your results bucket on S3?

This will get saved to s3://researcher-211125375652-us-east-2/cell-type-consensus once I generate the file for all samples.

What types of results does your code produce (e.g., table, figure)?

A TSV file with all cells in ScPCA and all cell type annotations (SingleR, CellAssign, and consensus).

What is your summary of the results?

That is coming up next!

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

I was able to run this locally on a subset of samples.

Author checklists

Analysis module and review

Reproducibility checklist

  • Code in this pull request has been added to the GitHub Action workflow that runs this module.
  • The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
  • If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
  • If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

@allyhawkins allyhawkins requested review from jashapiro and removed request for jaclyn-taroni January 8, 2025 20:16
@allyhawkins allyhawkins changed the title Allyhawkins/assign scpca consensus Workflow to assign consensus cell types to ScPCA samples Jan 8, 2025
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good overall, with a few caveats:

  • The first is the use of here, which could cause trouble in the future (if you made an R project for this module, which I actually recommend doing)
  • The other concern I have is about the addition of label.fine; I seem to recall that there were cases where there was not a 1:1 mapping from ontology to the fine label, which may result in duplicate rows (by cell id) in the output table.
  • A few other minor comments.


See the [`scripts/README.md`](./scripts/README.md) for instructions on running the scripts in this module.
The `assign-consensus-celltypes.sh` workflow can be used to assign a consensus cell type for all samples in ScPCA.
This workflow outputs a single TSV file with cell type annotations for all cells in ScPCA (excluding cell line samples).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just on wording, I don't know if we want to call this a workflow, as I keep thinking that will mean something more like Nextflow. I think you can just call it a script and describe the whole thing as the module output.

analyses/cell-type-consensus/assign-consensus-celltypes.sh Outdated Show resolved Hide resolved
analyses/cell-type-consensus/scripts/03-save-coldata.R Outdated Show resolved Hide resolved
# consensus_annotation: human readable name associated with the consensus label
# consensus_ontology: CL ontology term for the consensus cell type

project_root <- here::here()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't particularly like this here, as if an Rproject file gets added to the module, this will start to fail. It is generally better to use rprojroot and look explicitly for the git directory, or we could make it look for .github/, to prevent issues if the repo is checked out by the api (which would not include the .git directory).

so either

project_root <- rprojroot::find_root(rprojroot::is_git_root)

or maybe safer:

project_root <- rprojroot::find_root(rprojroot::has_dir(".github"))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting that this fails in CI with the following error:

Error: No root directory found in /__w/OpenScPCA-analysis/OpenScPCA-analysis/analyses/cell-type-consensus or its parent directories. Root criterion: one of
- contains a directory ".git"
- contains a file ".git" with contents matching "^gitdir: "
Execution halted

see https://github.com/AlexsLemonade/OpenScPCA-analysis/actions/runs/12694163418/job/35383339441

I'm looking into it, but posting it here in case you have any obvious solutions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was the issue I was referring to with checking out by the API, which is what will happen with docker images that don't have git installed. (See https://github.com/AlexsLemonade/OpenScPCA-analysis/actions/runs/12694163418/job/35383339441#step:4:26).

You could add a step to install git, but maybe try the rprojroot::find_root(rprojroot::has_dir(".github")) option?

Comment on lines +77 to +78
blueprint_ontology = blueprint_ref$label.ont,
blueprint_annotation_fine = blueprint_ref$label.fine
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any cases where the ontology maps to more than one fine label? I feel like this is the case, which would result in duplicate rows when joining.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The ontology and fine label are 1:1. I checked this by making sure the length of the unique values in this data frame are equivalent and equal to the number of rows in the dataframe. Also from the celldex vignette:

Typically, each reference provides three levels of cell type annotation in its column metadata:

label.main, broad annotation that defines the major cell types. This has few unique levels that allows for fast annotation but at low resolution.
label.fine, fine-grained annotation that defines subtypes or states. This has more unique levels that results in slower annotation but at much higher resolution.
label.ont, fine-grained annotation mapped to the standard vocabulary in the Cell Ontology. This enables synchronization of labels across references as well as dynamic adjustment of the resolution.

This tells me that the label.fine are the names associated with the ontology IDs in the cell type ontology. I checked a few of these manually and they match up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the last sentence a bit differently: they map to Cell Ontology but that does not mean that every fine label has a distinct ontology label.

In the case of Blueprint, it turns out that this is true, but it is not always; the HumanPrimaryCellAtlas data has more fine labels than ontologies.

> hpc_ref <- celldex::HumanPrimaryCellAtlasData()
> length(unique(hpc_ref$label.fine))
[1] 157
> length(unique(hpc_ref$label.ont))
[1] 66

So I guess this is fine, but I guess I'm not sure what the value of it is? If we are just getting an alternative to the ontology label with a different naming convention, do we need it at all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess this is fine, but I guess I'm not sure what the value of it is? If we are just getting an alternative to the ontology label with a different naming convention, do we need it at all?

Perhaps the better approach here is to directly map the ontology ids to the name associated with them ourselves and remove the blueprint_annotation_fine column all together. I think I'll go ahead and do this but in the script that creates the consensus reference to begin with and not here. I'll file a new issue and new PR to address this before we actually run this on all ScPCA samples.

Comment on lines 34 to 38
make_option(
opt_str = c("--output_file"),
type = "character",
help = "Path to file where combined TSV file will be saved, must end in `.tsv`"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the output file here is large: do we want to allow compressed file output? You don't have code to enforce the .tsv ending, and write_tsv supports this just by specifying the compression in file name (.tsv.gz, etc.) so just updating the help text would be sufficient here.

@allyhawkins
Copy link
Member Author

@jashapiro thanks for the feedback here. I made the following changes based on your comments:

  • I updated the wording in the README to use "script" instead of "workflow".
  • I removed any here::here() and use rprojroot. We also do have an R project for this module already.
  • The module_dir variable has been removed from the shell script and I checked that everything still runs as expected (at least locally).
  • I modified the script that saves the combined files to check the file extension and allowed for either .tsv or .tsv.gz. I also updated the path to this file in the shell script to save a compressed version of the file.

The other change that I had to make was to modify some of the code for the 03-save-coldata.R script to accommodate multiplexed libraries. There I am combining the sample ids rather than worrying about choosing a sample id from the demultiplexing results. I probably don't even need the sample ids in the final TSV file, but I felt like it might come in handy in the future so I want to keep that column. The other thing I had to do was account for metadata(sample_type) to be a list.

This should be ready for another round of review.

@allyhawkins allyhawkins requested a review from jashapiro January 9, 2025 16:30
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, with a few little comments.

analyses/cell-type-consensus/assign-consensus-celltypes.sh Outdated Show resolved Hide resolved
Comment on lines +44 to +45
sample_id <- metadata(sce)$sample_id |>
paste0(collapse = ";")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the correct solution (for now at least).

Comment on lines +77 to +78
blueprint_ontology = blueprint_ref$label.ont,
blueprint_annotation_fine = blueprint_ref$label.fine
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the last sentence a bit differently: they map to Cell Ontology but that does not mean that every fine label has a distinct ontology label.

In the case of Blueprint, it turns out that this is true, but it is not always; the HumanPrimaryCellAtlas data has more fine labels than ontologies.

> hpc_ref <- celldex::HumanPrimaryCellAtlasData()
> length(unique(hpc_ref$label.fine))
[1] 157
> length(unique(hpc_ref$label.ont))
[1] 66

So I guess this is fine, but I guess I'm not sure what the value of it is? If we are just getting an alternative to the ontology label with a different naming convention, do we need it at all?

@allyhawkins allyhawkins merged commit a18421e into AlexsLemonade:main Jan 10, 2025
3 checks passed
@allyhawkins allyhawkins deleted the allyhawkins/assign-scpca-consensus branch January 10, 2025 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a table with consensus cell types across all ScPCA samples
2 participants