-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workflow to assign consensus cell types to ScPCA samples #977
Workflow to assign consensus cell types to ScPCA samples #977
Conversation
…sign-scpca-consensus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good overall, with a few caveats:
- The first is the use of
here
, which could cause trouble in the future (if you made an R project for this module, which I actually recommend doing) - The other concern I have is about the addition of
label.fine
; I seem to recall that there were cases where there was not a 1:1 mapping from ontology to the fine label, which may result in duplicate rows (by cell id) in the output table. - A few other minor comments.
|
||
See the [`scripts/README.md`](./scripts/README.md) for instructions on running the scripts in this module. | ||
The `assign-consensus-celltypes.sh` workflow can be used to assign a consensus cell type for all samples in ScPCA. | ||
This workflow outputs a single TSV file with cell type annotations for all cells in ScPCA (excluding cell line samples). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just on wording, I don't know if we want to call this a workflow, as I keep thinking that will mean something more like Nextflow. I think you can just call it a script and describe the whole thing as the module output.
# consensus_annotation: human readable name associated with the consensus label | ||
# consensus_ontology: CL ontology term for the consensus cell type | ||
|
||
project_root <- here::here() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't particularly like this here, as if an Rproject file gets added to the module, this will start to fail. It is generally better to use rprojroot
and look explicitly for the git directory, or we could make it look for .github/
, to prevent issues if the repo is checked out by the api (which would not include the .git
directory).
so either
project_root <- rprojroot::find_root(rprojroot::is_git_root)
or maybe safer:
project_root <- rprojroot::find_root(rprojroot::has_dir(".github"))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting that this fails in CI with the following error:
Error: No root directory found in /__w/OpenScPCA-analysis/OpenScPCA-analysis/analyses/cell-type-consensus or its parent directories. Root criterion: one of
- contains a directory ".git"
- contains a file ".git" with contents matching "^gitdir: "
Execution halted
see https://github.com/AlexsLemonade/OpenScPCA-analysis/actions/runs/12694163418/job/35383339441
I'm looking into it, but posting it here in case you have any obvious solutions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this was the issue I was referring to with checking out by the API, which is what will happen with docker images that don't have git
installed. (See https://github.com/AlexsLemonade/OpenScPCA-analysis/actions/runs/12694163418/job/35383339441#step:4:26).
You could add a step to install git
, but maybe try the rprojroot::find_root(rprojroot::has_dir(".github"))
option?
blueprint_ontology = blueprint_ref$label.ont, | ||
blueprint_annotation_fine = blueprint_ref$label.fine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any cases where the ontology maps to more than one fine label? I feel like this is the case, which would result in duplicate rows when joining.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. The ontology and fine label are 1:1. I checked this by making sure the length of the unique values in this data frame are equivalent and equal to the number of rows in the dataframe. Also from the celldex vignette:
Typically, each reference provides three levels of cell type annotation in its column metadata:
label.main, broad annotation that defines the major cell types. This has few unique levels that allows for fast annotation but at low resolution.
label.fine, fine-grained annotation that defines subtypes or states. This has more unique levels that results in slower annotation but at much higher resolution.
label.ont, fine-grained annotation mapped to the standard vocabulary in the Cell Ontology. This enables synchronization of labels across references as well as dynamic adjustment of the resolution.
This tells me that the label.fine
are the names associated with the ontology IDs in the cell type ontology. I checked a few of these manually and they match up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read the last sentence a bit differently: they map to Cell Ontology but that does not mean that every fine label has a distinct ontology label.
In the case of Blueprint, it turns out that this is true, but it is not always; the HumanPrimaryCellAtlas data has more fine labels than ontologies.
> hpc_ref <- celldex::HumanPrimaryCellAtlasData()
> length(unique(hpc_ref$label.fine))
[1] 157
> length(unique(hpc_ref$label.ont))
[1] 66
So I guess this is fine, but I guess I'm not sure what the value of it is? If we are just getting an alternative to the ontology label with a different naming convention, do we need it at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I guess this is fine, but I guess I'm not sure what the value of it is? If we are just getting an alternative to the ontology label with a different naming convention, do we need it at all?
Perhaps the better approach here is to directly map the ontology ids to the name associated with them ourselves and remove the blueprint_annotation_fine
column all together. I think I'll go ahead and do this but in the script that creates the consensus reference to begin with and not here. I'll file a new issue and new PR to address this before we actually run this on all ScPCA samples.
make_option( | ||
opt_str = c("--output_file"), | ||
type = "character", | ||
help = "Path to file where combined TSV file will be saved, must end in `.tsv`" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the output file here is large: do we want to allow compressed file output? You don't have code to enforce the .tsv
ending, and write_tsv
supports this just by specifying the compression in file name (.tsv.gz
, etc.) so just updating the help text would be sufficient here.
@jashapiro thanks for the feedback here. I made the following changes based on your comments:
The other change that I had to make was to modify some of the code for the This should be ready for another round of review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, with a few little comments.
sample_id <- metadata(sce)$sample_id |> | ||
paste0(collapse = ";") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like the correct solution (for now at least).
blueprint_ontology = blueprint_ref$label.ont, | ||
blueprint_annotation_fine = blueprint_ref$label.fine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read the last sentence a bit differently: they map to Cell Ontology but that does not mean that every fine label has a distinct ontology label.
In the case of Blueprint, it turns out that this is true, but it is not always; the HumanPrimaryCellAtlas data has more fine labels than ontologies.
> hpc_ref <- celldex::HumanPrimaryCellAtlasData()
> length(unique(hpc_ref$label.fine))
[1] 157
> length(unique(hpc_ref$label.ont))
[1] 66
So I guess this is fine, but I guess I'm not sure what the value of it is? If we are just getting an alternative to the ontology label with a different naming convention, do we need it at all?
Co-authored-by: Joshua Shapiro <[email protected]>
Purpose/implementation Section
Please link to the GitHub issue that this pull request addresses.
Closes #969
What is the goal of this pull request?
Here I'm adding a workflow to assign the consensus cell type labels for all cells in ScPCA. This produces a single TSV file that contains all cells in all samples from ScPCA (except cell line samples) and all cell type annotations. Included in the table are the ids (project, sample, library), the original annotations from
scpca-nf
, the ontology id and name for the panglao term, and new columns for the consensus labels. Any combination that is present inconsensus-cell-type-reference.tsv
is assigned and then all other combinations get assigned to "Unknown" as theconsensus_annotation
andNA
for theconsensus_ontology
.Briefly describe the general approach you took to achieve this goal.
I wrote two short scripts to do this:
colData
, and saves it to a single TSV file.These two steps are combined into a single workflow that can be run on all samples in ScPCA.
A few things to note:
blueprint_annotation_fine
column, since that's the name that's directly linked to the ontology term inSingleR
.renv
changes that are needed and since the GHA for this module runs with the Docker image, I'll file a separate PR to update that first. Until that gets in, I expect CI to fail.If known, do you anticipate filing additional pull requests to complete this analysis module?
Yes! Next up we will actually look at the results.
Results
What is the name of your results bucket on S3?
This will get saved to
s3://researcher-211125375652-us-east-2/cell-type-consensus
once I generate the file for all samples.What types of results does your code produce (e.g., table, figure)?
A TSV file with all cells in ScPCA and all cell type annotations (SingleR, CellAssign, and consensus).
What is your summary of the results?
That is coming up next!
Provide directions for reviewers
What are the software and computational requirements needed to be able to run the code in this PR?
I was able to run this locally on a subset of samples.
Author checklists
Analysis module and review
README.md
has been updated to reflect code changes in this pull request.Reproducibility checklist
Dockerfile
.environment.yml
file.renv.lock
file.