Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build reference for consensus cell type labels #973

Conversation

allyhawkins
Copy link
Member

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

Closes #951

What is the goal of this pull request?

Here I'm adding a table to the references folder that contains all possible consensus cell type labels. Ultimately this table can be used to assign a consensus label for all cells in ScPCA samples based on the combination of labels from SingleR/Blueprint and CellAssign/PanglaoDB. In this table, each row is a unique combination of cell types from Panglao and Blueprint, and there is a column for the consensus label that corresponds to the LCA for that set of labels.

Note that I am only including the combinations that result in a consensus label that is NOT unknown. The total number of unique combinations is > 7000 and only 301 of those result in a label based on the rules we have set in place. I originally made a table with all combinations and set everything that wasn't assigned to "Unknown", but then I can't store it in this repo because of pre-commit file limits. Let me know if we do want a table with every possible combination, even the unknowns. If that's the case we will have to figure out where to store it (on S3 in the results bucket probably).

Briefly describe the general approach you took to achieve this goal.

  • I wrote a script that programmatically assigns the consensus labels based on the rules we set place in Create reference for consensus cell type labels #951.

    • If more than 1 LCA is found, no consensus label is assigned with the exception of the hematopoietic precursor cell.
    • If the LCA has > 170 descendants, no consensus label is assigned with the exception of neuron and epithelial cell when Blueprint is Epithelial cells.
    • bone cell, lining cell, blood cell, progenitor cell, and supporting cell are all removed as possible consensus labels.
  • The script then saves a table with all combinations for which a consensus label was identified. This table includes columns for the panglao ontology/annotation, blueprint ontology/annotation, and consensus ontology/annotation. Again, I did not include every "Unknown" combination.

  • I updated documentation throughout. I mostly did this to document the rules we are implementing in defining the consensus labels and the process that we used to actually create this reference. The main README is still not fully complete, but I imagine that will get filled up as we work on actually assigning the consensus labels.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Yes! Time to assign labels next.

Provide directions for reviewers

Is there anything that you want to discuss further?

Here's the final list of cell type annotations that are used for consensus labels for reference:

> unique(consensus_labels_df$consensus_annotation)
 [1] "myeloid leukocyte"                   "granulocyte"                         "neutrophil"                         
 [4] "mononuclear phagocyte"               "monocyte"                            "hematopoietic precursor cell"       
 [7] "T cell"                              "CD4-positive, alpha-beta T cell"     "mature alpha-beta T cell"           
[10] "mature T cell"                       "regulatory T cell"                   "memory T cell"                      
[13] "natural killer cell"                 "innate lymphoid cell"                "B cell"                             
[16] "lymphocyte of B lineage"             "mature B cell"                       "naive B cell"                       
[19] "memory B cell"                       "somatic stem cell"                   "stem cell"                          
[22] "hematopoietic stem cell"             "macrophage"                          "erythroid lineage cell"             
[25] "megakaryocyte"                       "endothelial cell"                    "dendritic cell"                     
[28] "eosinophil"                          "plasma cell"                         "chondrocyte"                        
[31] "stromal cell"                        "extracellular matrix secreting cell" "fibroblast"                         
[34] "smooth muscle cell"                  "muscle cell"                         "epithelial cell"                    
[37] "melanocyte"                          "cell of skeletal muscle"             "ecto-epithelial cell"               
[40] "keratinocyte"                        "squamous epithelial cell"            "epidermal cell"                     
[43] "blood vessel endothelial cell"       "microvascular endothelial cell"      "adipocyte"                          
[46] "pericyte"                            "astrocyte"                           "glial cell"                         
[49] "macroglial cell"                     "neuron associated cell"              "mesangial cell"     

Analysis module and review

Reproducibility checklist

  • Code in this pull request has been added to the GitHub Action workflow that runs this module.
  • The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
  • If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
  • If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

CL:0002038 T follicular helper cell CL:0000624 CD4+ T-cells CD4+ T-cells CL:0000624 CD4-positive, alpha-beta T cell
CL:0000893 thymocyte CL:0000624 CD4+ T-cells CD4+ T-cells CL:0000084 T cell
CL:0000798 gamma-delta T cell CL:0000624 CD4+ T-cells CD4+ T-cells CL:0000084 T cell
CL:0000814 NK lymphocyte CL:0000624 CD4+ T-cells CD4+ T-cells CL:0000791 mature alpha-beta T cell
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a second here between meetings, so I will return this because it impacts more than one line.

I almost certainly missed this in an earlier review, but this seems like the wrong term for "NK lymphocyte." This seems like a better fit to me: http://purl.obolibrary.org/obo/CL_0000623

Copy link
Member Author

@allyhawkins allyhawkins Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah it looks like the ID is correct, but the name for the ID is wrong in the original file where we assigned IDs. The term from Panglao is "Natural killer T cells" so that should be assigned to CL:0000814, but the name should be "mature NK T cell", not "NK lymphocytes". I'll fix that here.

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! I went through the result consensus labels, and they seem sound. We should find out if there's a way to get versioned OBO files before we wrap this up entirely.

# Prep references --------------------------------------------------

# grab obo file
cl_ont <- ontologyIndex::get_ontology("http://purl.obolibrary.org/obo/cl-basic.obo")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a more versioned option here? I can imagine the results would change over time if we don't lock it down more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I updated this to use a specific release.


# get ontologies and human readable name into data frame
blueprint_df <- data.frame(
blueprint_ontology = blueprint_ref$label.ont,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly a question for my understanding – when we use label.ont is that specific for/tied to label.main?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's tied to label.fine. From the celldex vignette:

Typically, each reference provides three levels of cell type annotation in its column metadata:

  • label.main, broad annotation that defines the major cell types. This has few unique levels that allows for fast annotation but at low resolution.
  • label.fine, fine-grained annotation that defines subtypes or states. This has more unique levels that results in slower annotation but at much higher resolution.
  • label.ont, fine-grained annotation mapped to the standard vocabulary in the Cell Ontology. This enables synchronization of labels across references as well as dynamic adjustment of the resolution.

I manually checked a few terms to be sure that the names match up with label.fine.

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

@allyhawkins allyhawkins merged commit f82012b into AlexsLemonade:main Jan 7, 2025
3 checks passed
@allyhawkins allyhawkins deleted the allyhawkins/build-consensus-reference branch January 7, 2025 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create reference for consensus cell type labels
2 participants