Ally Hawkins This notebook aims to identify a set of consensus labels between cell
+types in the PanglaoDB and Blueprint Encode references. Below I will calculate the total number of ancestors and the total
+number of descendants for each term in the full cell type ontology and
+then show the distributions for those statistics. This will give us an
+idea of the range of values we expect to see when looking at the
+PanglaoDB and Blueprint Encode references. The vertical lines in the below plot indicate the value for cell
+types of varying granularity. Generally it looks like as the cell types get more specific we see a
+greater number of ancestors. However, the range of values is small and
+we see some cell types have the same value and probably not the same
+level of granularity. Below we will look at total number of descendants. It looks like most cell types have very few descendants, so let’s
+zoom into the area below 500 to get a better look. Here we see a much larger range of values and that cell types become
+more general as the number of descendants goes up. However, this
+distribution alone is probably not helpful in determining a cutoff. The
+next section we will look at this distribution specifically for cell
+types present in our references, PanglaoDB and Blueprint encode. This section will look at identifying the latest common ancestor
+(LCA) between all possible combinations of terms from PanglaoDB (used
+for assigning cell types with Note that it is possible to have more than one LCA for a set of
+terms. To start, I will keep all LCA terms found. For each LCA, I will again look at the total number of ancestors and
+descendants and see if I can identify an appropriate cutoff. Ultimately,
+I would like to see if we can use that cutoff to decide if we should
+keep the LCA term as the consensus label or use “Unknown”. Let’s zoom into the area below 1000, since we already know we would
+want to exlude anything above that based on this plot. We can use the vertical lines for cells of interest to help us define
+a potential cutoff based on the granularity we would like to see in our
+consensus label. We want to be able to label things like T cell, but we
+don’t want to label anything as lymphocyte as that’s probably not
+helpful. I don’t see any obvious cutoffs that may be present in the
+total number of ancestors, but the number of descendants is likely to be
+informative. I think it might be a good idea to start by drawing a line
+at the local maxima between the T cell and lymphocyte lines on the
+number of descendants graph. First we will find the value for the first peak shown in the
+distribution. This is likely to be a good cutoff for deciding which LCA
+labels to keep. Below is the list of all consensus cell type labels that we will be
+keeping if we were to just use this cutoff. We can also look at all the cell types we are keeping and the total
+number of descendants to see if there are any that may be we don’t want
+to include because the term is too braod. There are a few terms that I think might be more broad than we want
+like One could also argue to remove Below are tables that look specifically at the combinations of cell
+type annotations that resulted in some of the terms that I might
+consider removing. I think I’m in favor of not having a “blood cell” label, since I’m
+not sure that it’s helpful. Also, if two different methods label
+something a platelet and a neutrophil, then perhaps that label is
+inaccurate and it’s really a tumor cell. I think I would also remove bone cell, since hematopoietic stem cells
+and osteoclasts seem pretty different to me. I’m torn on this one, because I do think it’s helpful to know if
+something is of the myeloid lineage, but if we aren’t keeping lymphocyte
+then I would argue we shouldn’t keep myeloid leukocyte. Same with Along those same lines, I think the below terms,
+ We can also look at what cell type labels we are excluding when using
+this cut off to see if there are any terms we might actually want to
+keep instead. The only term in this list that I would be concerned about losing is
+“neuron”. Let’s look at those combinations. It looks like there are a lot of types of neurons in the PanglaoDB
+reference and only “neuron” as a term in Blueprint. Even though neuron
+has ~ 500 descendants, I think we should keep these labels. One thing I noticed when looking at the labels that have less than
+the cutoff is that most of them are from scenarios where we have
+multiple LCAs. Maybe in the case where we have multiple LCAs we are
+already too broad and we should just eliminate those matches from the
+beginning. Here I’m looking at the total number of descendants for all
+terms that show up because a term has multiple LCAs. It looks like most of these terms are pretty broad and are either
+much higher than the cutoff or right around the cutoff with a few
+exceptions. Things like “bone cell” and “supporting cell” have few
+descendants, but I would still argue these are very broad terms and not
+useful. I’m going to filter out any matches that show two LCA terms first and
+then use the cutoff to define labels we would keep. I’ll also look to
+see what cell types we lose when we add this extra filtering step to be
+sure they are ones that we want to lose. It looks like I am losing a few terms I already said were not
+specific and then a few other terms, like “hematopoietic precursor cell”
+and “perivascular cell”. I’ll look at both of those to confirm we would
+not want them. It looks like here we should be keeping these matches because both
+references have these labels as hematopoietic stem and progenitor cells.
+I think in the context of pediatric cancer having this label would be
+helpful, so maybe we shouldn’t remove all terms that have 2 LCAs. Let’s look at what the other LCA is for an example set. It looks like these terms have both
+ I would remove An alternative approach would be to calculate the similarity
+index between each set of terms and define a cutoff for which set of
+terms are similar. This is a value on a 0-1 scale where 0 indicates no
+similarity and 1 indicates the terms are equal. Although this could provide a metric that we could use to define
+similar cell types, we would still have to identify the label to use
+which would most likely be the LCA. Even if the similarity index is
+close to 1, if the LCA term is not informative then I don’t know that we
+would want to use that. However, we could use this to finalize the actual pairs of terms that
+we trust. For example, if the LCA for a pair is Below I’ll calculate the similarity index for each set of terms and
+plot the distribution. Then we will look at the values for pairs that
+have an LCA that pass the total descendants threshold we set to see if
+those pairs have a higher similarity index. This looks as I expected with most of the pairs that pass the total
+descendants cutoff having a higher similarity index than those that do
+not pass. There is still some overlap though so perhaps even if a set of
+terms shares an LCA that passes the threshold, the actual terms being
+compared may be further apart than we would like. Now let’s look at the similarity index for various LCA terms. Here
+each LCA term is its own plot and the vertical lines are the similarity
+index for each pair of terms that results in that LCA. It looks like terms that are more granular like T and B cell have
+higher similarity index values than terms that are less granular which
+is what we would expect. However, within terms like myeloid leukocyte
+and even T cell we do see a range of values. We could dig deeper into
+which pairs are resulting in which similarity index values if we wanted
+to, but I think that might be a future direction if we feel like the
+similarity index is something that could be useful. Based on these findings, I think it might be best to create a
+reference that has all possible pairs of labels between PanglaoDB and
+Blueprint Encode and the resulting consensus label for those pairs. To
+do this we could come up with a whitelist of LCA terms that we would be
+comfortable including and all other cell types would be unknowns. I
+would use the following criteria to come up with my whitelist: Alternatively, rather than eliminate terms that are too broad we
+could look at the similarity index for individual matches and decide on
+a case by case basis if those should be allowed. Although I still think
+having a term that is too braod, even if it’s a good match, is not super
The vertical lines in the below plot indicate the value for cell
+ # load required packages
+ library(ggplot2)
+# Set default ggplot theme
+ theme_bw()
+# The base path for the OpenScPCA repository, found by its (hidden) .git directory
+repository_base <- rprojroot::find_root(rprojroot::is_git_root)
+# The path to this module
+ref_dir <- file.path(repository_base, "analyses", "cell-type-consensus", "references")
+# path to ref file for panglao
+panglao_file <- file.path(ref_dir, "panglao-cell-type-ontologies.tsv")
+# grab obo file
+cl_ont <- ontologyIndex::get_ontology("")
+# read in panglao file
+panglao_df <- readr::read_tsv(panglao_file) |>
+ # rename columns to have panglao in them for easy joining later
+ dplyr::select(
+ panglao_ontology = "ontology_id",
+ panglao_annotation = "human_readable_value"
+ )
+# grab singler ref from celldex
+blueprint_ref <- celldex::BlueprintEncodeData()
+# get ontologies and human readable name into data frame
+blueprint_df <- data.frame(
+ blueprint_ontology = blueprint_ref$label.ont,
+ blueprint_annotation_main = blueprint_ref$label.main,
+ blueprint_annotation_fine = blueprint_ref$label.fine
+) |>
+ unique()
Full cell ontology
+# turn cl_ont into data frame with one row per term
+cl_df <- data.frame(
+ cl_ontology = cl_ont$id,
+ cl_annotation = cl_ont$name
+) |>
+ dplyr::rowwise() |>
+ dplyr::mutate(
+ # list all ancestors and descendants calculate total
+ ancestors = list(ontologyIndex::get_ancestors(cl_ont, cl_ontology)),
+ total_ancestors = length(ancestors),
+ descendants = list(ontologyIndex::get_descendants(cl_ont, cl_ontology)),
+ total_descendants = length(descendants)
+ )
+celltypes_of_interest <- c("eukaryotic cell", "lymphocyte", "leukocyte", "hematopoietic cell", "T cell", "endothelial cell", "smooth muscle cell", "memory T cell")
+line_df <- cl_df |>
+ dplyr::filter(cl_annotation %in% celltypes_of_interest) |>
+ dplyr::select(cl_annotation, total_descendants, total_ancestors) |>
+ unique()
+# group any labels that have the same number of ancestors
+ancestor_labels_df <- line_df |>
+ dplyr::group_by(total_ancestors) |>
+ dplyr::summarise(cl_annotation = paste(cl_annotation, collapse = ","))
+# make density plots showing distribution of ancestors and descendants
+ggplot(cl_df, aes(x = total_ancestors)) +
+ geom_density(fill = "#00274C", alpha = 0.5) +
+ geom_vline(data = ancestor_labels_df,
+ mapping = aes(xintercept = total_ancestors),
+ lty = 2) +
+ geom_text(
+ data = ancestor_labels_df,
+ mapping = aes(x = total_ancestors, y = 0.04, label = cl_annotation),
+ angle = 90,
+ vjust = -0.5
+ ) +
+ labs(
+ x = "Number of ancestors",
+ y = "Density"
+ )
+ggplot(cl_df, aes(x = total_descendants)) +
+ geom_density(fill = "#FFCB05", alpha = 0.5) +
+ geom_vline(data = line_df,
+ mapping = aes(xintercept = total_descendants),
+ lty = 2) +
+ geom_text(
+ data = line_df,
+ mapping = aes(x = total_descendants, y = 0.6, label = cl_annotation),
+ angle = 90,
+ vjust = -0.5
+ ) +
+ labs(
+ x = "Number of descendants",
+ y = "Density"
+ )
+ggplot(cl_df, aes(x = total_descendants)) +
+ geom_density(fill = "#FFCB05", alpha = 0.5) +
+ geom_vline(data = line_df,
+ mapping = aes(xintercept = total_descendants),
+ lty = 2) +
+ geom_text(
+ data = line_df,
+ mapping = aes(x = total_descendants, y = 0.6, label = cl_annotation),
+ angle = 90,
+ vjust = -0.5
+ ) +
+ labs(
+ x = "Number of descendants",
+ y = "Density"
+ ) +
+ xlim(c(0,500))
+## Warning: Removed 14 rows containing non-finite outside the scale range (`stat_density()`).
+## Warning: Removed 3 rows containing missing values or values outside the scale range (`geom_vline()`).
+## Warning: Removed 3 rows containing missing values or values outside the scale range (`geom_text()`).
Latest common ancestor (LCA) between PanglaoDB and Blueprint
) and the
reference from celldex
+(used for assigning cell types with SingleR
). The LCA
+refers to the latest term in the cell ontology heirarchy that is common
+between two terms. I will use the ontoProc::findCommonAncestors()
+function to get the LCA for each combination.
+# first set up the graph from cl ont
+parent_terms <- cl$parents
+cl_graph <- igraph::make_graph(rbind(unlist(parent_terms), rep(names(parent_terms), lengths(parent_terms))))
+# get a data frame with all combinations of panglao and blueprint terms
+# one row for each combination
+all_ref_df <- expand.grid(panglao_df$panglao_ontology,
+ blueprint_df$blueprint_ontology) |>
+ dplyr::rename(
+ panglao_ontology = "Var1",
+ blueprint_ontology = "Var2"
+ ) |>
+ # add in the human readable values for each ontology term
+ dplyr::left_join(blueprint_df, by = "blueprint_ontology") |>
+ dplyr::left_join(panglao_df, by = "panglao_ontology") |>
+ tidyr::drop_na() |>
+ dplyr::rowwise() |>
+ dplyr::mutate(
+ # least common shared ancestor
+ lca = list(rownames(ontoProc::findCommonAncestors(blueprint_ontology, panglao_ontology, g = g)))
+ )
+lca_df <- all_ref_df |>
+ dplyr::mutate(
+ total_lca = length(lca), # max is three terms
+ lca = paste0(lca, collapse = ",") # make it easier to split the df
+ ) |>
+ # split each lca term into its own column
+ tidyr::separate(lca, into = c("lca_1", "lca_2", "lca_3"), sep = ",") |>
+ tidyr::pivot_longer(
+ cols = dplyr::starts_with("lca"),
+ names_to = "lca_number",
+ values_to = "lca"
+ ) |>
+ tidyr::drop_na() |>
+ dplyr::select(-lca_number) |>
+ # account for any cases where the ontology IDs are exact matches
+ # r complains about doing this earlier since the lca column holds lists until now
+ dplyr::mutate(lca = dplyr::if_else(blueprint_ontology == panglao_ontology, blueprint_ontology, lca)) |>
+ # join in information for each of the lca terms including name, number of ancestors and descendants
+ dplyr::left_join(cl_df, by = c("lca" = "cl_ontology"))
Distribution of ancestors and descendants
+ggplot(lca_df, aes(x = total_ancestors)) +
+ geom_density() +
+ geom_vline(data = ancestor_labels_df,
+ mapping = aes(xintercept = total_ancestors),
+ lty = 2) +
+ geom_text(
+ data = ancestor_labels_df,
+ mapping = aes(x = total_ancestors, y = 0.6, label = cl_annotation),
+ angle = 90,
+ vjust = -0.5
+ ) +
+ labs(
+ x = "Total number of ancestors",
+ y = "Density"
+ )
+ggplot(lca_df, aes(x = total_descendants)) +
+ geom_density() +
+ geom_vline(data = line_df,
+ mapping = aes(xintercept = total_descendants),
+ lty = 2) +
+ geom_text(
+ data = line_df,
+ mapping = aes(x = total_descendants, y = 0.002, label = cl_annotation),
+ angle = 90,
+ vjust = -0.5
+ ) +
+ labs(
+ x = "Total number of descendants",
+ y = "Density"
+ )
+ggplot(lca_df, aes(x = total_descendants)) +
+ geom_density() +
+ geom_vline(data = line_df,
+ mapping = aes(xintercept = total_descendants),
+ lty = 2) +
+ geom_text(
+ data = line_df,
+ mapping = aes(x = total_descendants, y = 0.002, label = cl_annotation),
+ angle = 90,
+ vjust = -0.5
+ ) +
+ xlim(c(0, 1000)) +
+ labs(
+ x = "Total number of descendants",
+ y = "Density"
+ )
Defining a cutoff for number of descendants
+peak_idx <- splus2R::peaks(lca_df$total_descendants)
+cutoff <- lca_df$total_descendants[peak_idx] |>
+ min() # find the smallest peak and use that as the cutoff for number of descendants
+celltypes_to_keep <- lca_df |>
+ dplyr::filter(total_descendants <= cutoff) |>
+ dplyr::pull(cl_annotation) |>
+ unique()
+## [1] "myeloid leukocyte" "granulocyte" "neutrophil"
+## [4] "blood cell" "mononuclear phagocyte" "progenitor cell"
+## [7] "monocyte" "hematopoietic precursor cell" "T cell"
+## [10] "CD4-positive, alpha-beta T cell" "mature alpha-beta T cell" "mature T cell"
+## [13] "regulatory T cell" "memory T cell" "natural killer cell"
+## [16] "innate lymphoid cell" "B cell" "lymphocyte of B lineage"
+## [19] "mature B cell" "naive B cell" "memory B cell"
+## [22] "somatic stem cell" "stem cell" "hematopoietic stem cell"
+## [25] "bone cell" "macrophage" "erythroid lineage cell"
+## [28] "megakaryocyte" "endothelial cell" "lining cell"
+## [31] "dendritic cell" "eosinophil" "plasma cell"
+## [34] "chondrocyte" "stromal cell" "extracellular matrix secreting cell"
+## [37] "fibroblast" "smooth muscle cell" "muscle cell"
+## [40] "melanocyte" "cell of skeletal muscle" "ecto-epithelial cell"
+## [43] "keratinocyte" "squamous epithelial cell" "epidermal cell"
+## [46] "blood vessel endothelial cell" "microvascular endothelial cell" "adipocyte"
+## [49] "pericyte" "perivascular cell" "supporting cell"
+## [52] "astrocyte" "glial cell" "macroglial cell"
+## [55] "neuron associated cell" "mesangial cell"
+# pull out the cell types and total descendants for cell types to keep
+plot_celltype_df <- lca_df |>
+ dplyr::filter(cl_annotation %in% celltypes_to_keep) |>
+ dplyr::select(cl_annotation, total_descendants) |>
+ unique()
+# bar chart showing total number of descendants for each cell type
+ggplot(plot_celltype_df, aes(x = reorder(cl_annotation, total_descendants), y = total_descendants)) +
+ geom_bar(stat = "identity") +
+ theme(
+ axis.text.x = element_text(angle = 90)
+ ) +
+ labs(
+ x = "cell type",
+ y = "Total descendants"
+ )
blood cell
, bone cell
+supporting cell
, and lining cell
. I’m on the
+fence about keeping myeloid leukocyte
+progenitor cell
. I think if we wanted to remove those terms
+we could move our cutoff to be the same number of descendants as
+T cell
, since we do want to keep that.stromal cell
+extracellular matrix secreting cell
Blood cell
+print_df <- lca_df |>
+ dplyr::select(blueprint_ontology, blueprint_annotation_main, blueprint_annotation_fine, panglao_ontology, panglao_annotation, total_lca, lca, cl_annotation)
+# blood cell
+print_df |>
+ dplyr::filter(cl_annotation == "blood cell")
+blood cell
+blood cell
+blood cell
+blood cell
+blood cell
+blood cell
Bone cell
+# bone cell
+print_df |>
+ dplyr::filter(cl_annotation == "bone cell")
+bone cell
+bone cell
Myeloid leukocyte
+# myeloid leukocyte cell
+print_df |>
+ dplyr::filter(cl_annotation == "myeloid leukocyte")
+alveolar macrophage
+myeloid leukocyte
+myeloid leukocyte
+mast cell
+myeloid leukocyte
+myeloid leukocyte
+myeloid leukocyte
+myeloid leukocyte
+Kupffer cell
+myeloid leukocyte
+Langerhans cell
+myeloid leukocyte
+microglial cell
+myeloid leukocyte
+myeloid suppressor cell
+myeloid leukocyte
+myeloid leukocyte
+myeloid leukocyte
+splenic red pulp macrophage
+myeloid leukocyte
+alveolar macrophage
+myeloid leukocyte
+myeloid leukocyte
+myeloid leukocyte
+myeloid leukocyte
+mast cell
+myeloid leukocyte
+myeloid leukocyte
+myeloid leukocyte
+Kupffer cell
+myeloid leukocyte
+Langerhans cell
+myeloid leukocyte
+microglial cell
+myeloid leukocyte
+myeloid suppressor cell
+myeloid leukocyte
+splenic red pulp macrophage
+myeloid leukocyte
+myeloid leukocyte
+myeloid leukocyte
+mast cell
+myeloid leukocyte
+myeloid leukocyte
+myeloid leukocyte
+myeloid leukocyte
+myeloid leukocyte
+Langerhans cell
+myeloid leukocyte
+myeloid suppressor cell
+myeloid leukocyte
+myeloid leukocyte
+myeloid leukocyte
+Macrophages M1
+myeloid leukocyte
+Macrophages M1
+myeloid leukocyte
+Macrophages M1
+mast cell
+myeloid leukocyte
+Macrophages M1
+myeloid leukocyte
+Macrophages M1
+myeloid leukocyte
+Macrophages M1
+myeloid leukocyte
+Macrophages M1
+myeloid leukocyte
+Macrophages M1
+Langerhans cell
+myeloid leukocyte
+Macrophages M1
+myeloid suppressor cell
+myeloid leukocyte
+Macrophages M1
+myeloid leukocyte
+Macrophages M1
+myeloid leukocyte
+Macrophages M2
+myeloid leukocyte
+Macrophages M2
+myeloid leukocyte
+Macrophages M2
+mast cell
+myeloid leukocyte
+Macrophages M2
+myeloid leukocyte
+Macrophages M2
+myeloid leukocyte
+Macrophages M2
+myeloid leukocyte
+Macrophages M2
+myeloid leukocyte
+Macrophages M2
+Langerhans cell
+myeloid leukocyte
+Macrophages M2
+myeloid suppressor cell
+myeloid leukocyte
+Macrophages M2
+myeloid leukocyte
+Macrophages M2
+myeloid leukocyte
+alveolar macrophage
+myeloid leukocyte
+myeloid leukocyte
+mast cell
+myeloid leukocyte
+myeloid leukocyte
+myeloid leukocyte
+myeloid leukocyte
+Kupffer cell
+myeloid leukocyte
+Langerhans cell
+myeloid leukocyte
+microglial cell
+myeloid leukocyte
+myeloid suppressor cell
+myeloid leukocyte
+myeloid leukocyte
+myeloid leukocyte
+splenic red pulp macrophage
+myeloid leukocyte
Progenitor cell
+# progenitor cell
+print_df |>
+ dplyr::filter(cl_annotation == "progenitor cell") |>
+ head(n=15) # there's a lot of these so let's only print out some
+progenitor cell
+hematopoietic stem cell
+progenitor cell
+progenitor cell
+club cell
+progenitor cell
+erythroid progenitor cell
+progenitor cell
+neuronal-restricted precursor
+progenitor cell
+oligodendrocyte precursor cell
+progenitor cell
+progenitor cell of endocrine pancreas
+progenitor cell
+progenitor cell
+hematopoietic stem cell
+progenitor cell
+progenitor cell
+progenitor cell
+progenitor cell
+club cell
+progenitor cell
+erythroid progenitor cell
+progenitor cell
+progenitor cell
, I do think it could be
+helpful to know that something may be a progenitor cell, but when you
+have a cell with the label for HSC and the label for cells like
+monocytes or osteoblasts, then maybe we are talking about a tumor cell
+instead.lining cell
and supporting cell
, are too broad
Lining cell
+# lining cell
+print_df |>
+ dplyr::filter(cl_annotation == "lining cell")
+Endothelial cells
+Endothelial cells
+mesothelial cell
+lining cell
+Endothelial cells
+Endothelial cells
+peritubular myoid cell
+lining cell
+Endothelial cells
+Endothelial cells
+Sertoli cell
+lining cell
+Endothelial cells
+mv Endothelial cells
+mesothelial cell
+lining cell
+Endothelial cells
+mv Endothelial cells
+peritubular myoid cell
+lining cell
+Endothelial cells
+mv Endothelial cells
+Sertoli cell
+lining cell
Supporting cell
+# supporting cell
+print_df |>
+ dplyr::filter(cl_annotation == "supporting cell")
+Sertoli cell
+supporting cell
+Mesangial cells
+Mesangial cells
+Sertoli cell
+supporting cell
Discarded cell types
+lca_df |>
+ dplyr::filter(total_descendants > cutoff) |>
+ dplyr::pull(cl_annotation) |>
+ unique()
+## [1] "leukocyte" "eukaryotic cell" "myeloid cell"
+## [4] "cell" "hematopoietic cell" "mononuclear cell"
+## [7] "stuff accumulating cell" "precursor cell" "phagocyte (sensu Vertebrata)"
+## [10] "defensive cell" "lymphocyte" "professional antigen presenting cell"
+## [13] "secretory cell" "connective tissue cell" "electrically responsive cell"
+## [16] "contractile cell" "epithelial cell" "neuron"
+## [19] "neural cell"
+# blood cell
+print_df |>
+ dplyr::filter(cl_annotation == "neuron")
+adrenergic neuron
+cholinergic neuron
+chromaffin cell
+dopaminergic neuron
+enteric neuron
+glycinergic neuron
+motor neuron
+neuroendocrine cell
+noradrenergic neuron
+photoreceptor cell
+retinal ganglion cell
+serotonergic neuron
+trigeminal neuron
+Cajal-Retzius cell
+GABAergic neuron
+glutamatergic neuron
+Purkinje cell
+pyramidal neuron
Removing anything with more than 1 LCA
+lca_df |>
+ dplyr::filter(total_lca > 1) |>
+ dplyr::select(cl_annotation, total_descendants) |>
+ unique() |>
+ dplyr::arrange(total_descendants)
+bone cell
+blood cell
+perivascular cell
+stromal cell
+supporting cell
+hematopoietic precursor cell
+lining cell
+myeloid leukocyte
+progenitor cell
+mononuclear phagocyte
+phagocyte (sensu Vertebrata)
+contractile cell
+defensive cell
+professional antigen presenting cell
+connective tissue cell
+myeloid cell
+stuff accumulating cell
+precursor cell
+secretory cell
+mononuclear cell
+electrically responsive cell
+hematopoietic cell
+eukaryotic cell
+# remove any combinations with more than one lca
+filtered_lca_df <- lca_df |>
+ dplyr::filter(total_lca < 2)
+# get a list of cell types to keep based on cutoff
+updated_celltypes <- filtered_lca_df |>
+ dplyr::filter(total_descendants <= cutoff) |>
+ dplyr::pull(cl_annotation) |>
+ unique()
+# which cell types are now missing from the list to keep
+setdiff(celltypes_to_keep, updated_celltypes)
+## [1] "blood cell" "hematopoietic precursor cell" "lining cell"
+## [4] "perivascular cell" "supporting cell"
Hematopoietic precursor cell
+print_df |>
+ dplyr::filter(cl_annotation == "hematopoietic precursor cell")
+hematopoietic stem cell
+hematopoietic precursor cell
+erythroid progenitor cell
+hematopoietic precursor cell
+erythroid progenitor cell
+hematopoietic precursor cell
+hematopoietic stem cell
+hematopoietic precursor cell
+erythroid progenitor cell
+hematopoietic precursor cell
+hematopoietic stem cell
+hematopoietic precursor cell
+erythroid progenitor cell
+hematopoietic precursor cell
+hematopoietic stem cell
+hematopoietic precursor cell
+erythroid progenitor cell
+hematopoietic precursor cell
+hematopoietic stem cell
+hematopoietic precursor cell
+erythroid progenitor cell
+hematopoietic precursor cell
+lca_df |>
+ dplyr::filter(panglao_ontology == "CL:0000037" & blueprint_ontology == "CL:0000050") |>
+ dplyr::select(blueprint_annotation_main, blueprint_annotation_fine, panglao_annotation, cl_annotation)
+hematopoietic stem cell
+hematopoietic precursor cell
+hematopoietic stem cell
+progenitor cell
+hematopoietic precursor cell
+progenitor cell
as LCAs. Personally, I would keep the term
+for hematopoietic precursor cell
because I think it’s more
Perivascular cell
+print_df |>
+ dplyr::filter(cl_annotation == "perivascular cell")
+vascular associated smooth muscle cell
+perivascular cell
+vascular associated smooth muscle cell
+perivascular cell
+vascular associated smooth muscle cell
+perivascular cell
+vascular associated smooth muscle cell
+perivascular cell
+Mesangial cells
+Mesangial cells
+vascular associated smooth muscle cell
+perivascular cell
+Mesangial cells
+Mesangial cells
+vascular associated smooth muscle cell
+perivascular cell
+Mesangial cells
+Mesangial cells
+vascular associated smooth muscle cell
+perivascular cell
+Mesangial cells
+Mesangial cells
+vascular associated smooth muscle cell
+perivascular cell
+perivascular cell
, since the cell type
+labels from PanglaoDB and Blueprint are pretty different from each
Similarity index
+T cell
+can look at the similarity index to confirm that specific pair of terms
+has high similarity.
+information_content <- ontologySimilarity::descendants_IC(cl_ont)
+# get similarity index for each set of terms
+si_df <- lca_df |>
+ dplyr::rowwise() |>
+ dplyr::mutate(
+ similarity_index = ontologySimilarity::get_sim_grid(ontology = cl_ont,
+ term_sets = list(panglao_ontology, blueprint_ontology)) |>
+ ontologySimilarity::get_sim()
+ )
+si_df <- si_df |>
+ dplyr::mutate(
+ lca_threshold = dplyr::if_else(total_descendants < cutoff, "PASS", "FAIL")
+ )
+ggplot(si_df, aes(x = similarity_index, fill = lca_threshold)) +
+ geom_density(bw = 0.05, alpha = 0.5) +
+ labs(
+ x = "Similarity index",
+ y = "Density"
+ )
+celltypes_to_plot <- c("myeloid leukocyte", "T cell", "cell", "supporting cell", "B cell")
+celltypes_to_plot |>
+ purrr::map(\(celltype){
+ line_df <- si_df |>
+ dplyr::filter(cl_annotation == celltype) |>
+ dplyr::select(cl_annotation, similarity_index) |>
+ unique()
+ ggplot(si_df, aes(x = similarity_index)) +
+ geom_density() +
+ geom_vline(data = line_df,
+ mapping = aes(xintercept = similarity_index),
+ lty = 2) +
+ labs(
+ x = "Similarity index",
+ y = "Density",
+ title = celltype
+ )
+ })
+## [[1]]
+## [[2]]
+## [[3]]
+## [[4]]
+## [[5]]
even though it
+has 500 descendants.supporting cell
+blood cell
, bone cell
+lining cell
Session info
+# epithelial cell
+print_df |>
+ dplyr::filter(cl_annotation == "epithelial cell")
+Epithelial cells
+Epithelial cells
+acinar cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+epithelial cell
+Epithelial cells
+Epithelial cells
+chromaffin cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+epithelial cell
+Epithelial cells
+Epithelial cells
+enteroendocrine cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+ependymal cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+epithelial cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+goblet cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+granulosa cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+epithelial cell
+Epithelial cells
+Epithelial cells
+epithelial cell
+Epithelial cells
+Epithelial cells
+epithelial cell
+Epithelial cells
+Epithelial cells
+mesothelial cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+myoepithelial cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+neuroendocrine cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+olfactory epithelial cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+paneth cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+parietal cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+peritubular myoid cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+epithelial cell
+Epithelial cells
+Epithelial cells
+epithelial cell
+Epithelial cells
+Epithelial cells
+taste receptor cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+urothelial cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+respiratory epithelial cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+respiratory goblet cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+pancreatic A cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+type B pancreatic cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+choroid plexus epithelial cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+club cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+intestinal crypt stem cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+pancreatic D cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+epithelial cell of distal tubule
+epithelial cell
+Epithelial cells
+Epithelial cells
+pancreatic ductal cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+enterochromaffin-like cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+pancreatic epsilon cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+thyroid follicular cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+foveolar cell of stomach
+epithelial cell
+Epithelial cells
+Epithelial cells
+PP cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+peptic cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+type I cell of carotid body
+epithelial cell
+Epithelial cells
+Epithelial cells
+renal intercalated cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+kidney loop of Henle epithelial cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+luminal epithelial cell of mammary gland
+epithelial cell
+Epithelial cells
+Epithelial cells
+mammary gland epithelial cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+Merkel cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+M cell of gut
+epithelial cell
+Epithelial cells
+Epithelial cells
+oxyphil cell of parathyroid gland
+epithelial cell
+Epithelial cells
+Epithelial cells
+chief cell of parathyroid gland
+epithelial cell
+Epithelial cells
+Epithelial cells
+renal principal cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+epithelial cell of proximal tubule
+epithelial cell
+Epithelial cells
+Epithelial cells
+pulmonary alveolar type 1 cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+pulmonary alveolar type 2 cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+salivary gland glandular cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+acinar cell of sebaceous gland
+epithelial cell
+Epithelial cells
+Epithelial cells
+Sertoli cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+hair germinal matrix cell
+epithelial cell
+Epithelial cells
+Epithelial cells
+brush cell
+epithelial cell
+acinar cell
+epithelial cell
+epithelial cell
+epithelial cell
+enteroendocrine cell
+epithelial cell
+epithelial cell
+epithelial cell
+goblet cell
+epithelial cell
+granulosa cell
+epithelial cell
+epithelial cell
+epithelial cell
+myoepithelial cell
+epithelial cell
+paneth cell
+epithelial cell
+parietal cell
+epithelial cell
+epithelial cell
+taste receptor cell
+epithelial cell
+urothelial cell
+epithelial cell
+respiratory epithelial cell
+epithelial cell
+respiratory goblet cell
+epithelial cell
+pancreatic A cell
+epithelial cell
+type B pancreatic cell
+epithelial cell
+club cell
+epithelial cell
+intestinal crypt stem cell
+epithelial cell
+pancreatic D cell
+epithelial cell
+epithelial cell of distal tubule
+epithelial cell
+pancreatic ductal cell
+epithelial cell
+enterochromaffin-like cell
+epithelial cell
+pancreatic epsilon cell
+epithelial cell
+thyroid follicular cell
+epithelial cell
+foveolar cell of stomach
+epithelial cell
+PP cell
+epithelial cell
+peptic cell
+epithelial cell
+renal intercalated cell
+epithelial cell
+kidney loop of Henle epithelial cell
+epithelial cell
+luminal epithelial cell of mammary gland
+epithelial cell
+mammary gland epithelial cell
+epithelial cell
+M cell of gut
+epithelial cell
+oxyphil cell of parathyroid gland
+epithelial cell
+chief cell of parathyroid gland
+epithelial cell
+renal principal cell
+epithelial cell
+epithelial cell of proximal tubule
+epithelial cell
+salivary gland glandular cell
+epithelial cell
+brush cell
+epithelial cell
+The PanglaoDB cell types seem to be more specific than the ones
+present in Blueprint Encode, similar to the observation with neurons. We
+should keep epithelial cell.
