Exploration notebook for creating a reference of consensus labels #936

allyhawkins · 2024-12-12T19:00:37Z

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

Closes #889

What is the goal of this pull request?

Here I'm adding a notebook that looks at defining consensus cell types between all possible cell type labels in the PanglaoDB and Blueprint Encode references. For each possible pair in the references, I computed the latest common ancestor (LCA), which is the latest ancestor that is shared between the two labels. Ultimately, I think we could use the LCA as the consensus label for any pairs that meet the criteria that we set (e.g., the LCA has a total number of descendants less than a set threshold). Any pairs that do not meet the criteria would get an "Unknown" label.

Briefly describe the general approach you took to achieve this goal.

The first section just looks at the total number of ancestors and descendants in the whole cell ontology. This gives us an idea of what the possible values are. I also labeled some cell types of varying granularity to see where their value lies in comparison to all other possibilities.
Then I computed all possible pairs of labels between PanglaoDB and Blueprint Encode and found the LCA for each pair. Note that in some cases there are 2-3 LCA. For each LCA, I looked at the total number of ancestors and descendants. Again, I plotted some cell types with different granularity which helped me define a cutoff for the number of descendants. I wanted to keep T cell but lose lymphocyte so I chose the cutoff to be the peak between those two terms.
Based on that cutoff I printed out all the terms that would get saved as consensus labels. I think most terms are helpful to keep, but there were other terms that I don't think we want to keep as a consensus label since they would not be very informative (like lining cell, supporting cell).
I looked at specific pairs that resulted in an LCA with a low number of descendants, but still seemed too broad to include as a consensus label. A lot of these cases were pairs that had more than one LCA, so I looked at adding that as a filtering step. If we remove terms that have more than one possible LCA and then use the threshold we lose some of the broader types like lining cell and supporting cell.
However, with that filtering we also lose hematopoietic precursor cell and looking at those pairs they do all fall in the category of hematopoietic stem and progenitor cells, which I think is a useful label to have.
I also looked at LCA cell types that we would lose with the threshold that we set and the only one that I think is important to keep is neuron. This term has 500 + descendants, but I think knowing if a cell is a neuron is pretty useful information.
The last thing I did was look at the similarity index for all possible pairs. If we wanted to use a metric to define the pairs to label rather than a threshold for number of descendants this could be a possibility. However, I think we still would need to use the LCA as the consensus label, and if that label is too broad, even if the similarity index is close to 1, I don't think we would want to use it. I think if we want to be specific about each of the pairs that we label with a consensus label, then the similarity index would be helpful; otherwise, I don't know that we need this, so I didn't go too deep here (also, this was getting too long).

If known, do you anticipate filing additional pull requests to complete this analysis module?

Yes

Results

What types of results does your code produce (e.g., table, figure)?

All the results are in the rendered copy of the notebook. I'm including it here for easy review:
01-reference-exploration.html.zip

What is your summary of the results?

Generally I think we can use this to guide creation of our own consensus label reference. I invision this being a table with all possible cell type label combinations between PanglaoDB and Blueprint Encode and then a column with the LCA or consensus label. Along side this, we would have a whitelist of acceptable LCA terms that either pass the threshold that we set for number of descendants or are granular enough that we would want to keep. I'm wary of just relying on the threshold since terms like neuron would get lost while terms like lining cell would be kept.

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

I ran this locally using the renv environment.

Are there particularly areas you'd like reviewers to have a close look at?

Are there other metrics or items that you think we should look at?

Author checklists

Check all those that apply.
Note that you may find it easier to check off these items after the pull request is actually filed.

Analysis module and review

This analysis module uses the analysis template and has the expected directory structure.
The analysis module README.md has been updated to reflect code changes in this pull request.
The analytical code is documented and contains comments.
Any results and/or plots this code produces have been added to your S3 bucket for review.

Reproducibility checklist

Code in this pull request has been added to the GitHub Action workflow that runs this module.
The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

…tology-exploration

allyhawkins · 2024-12-12T19:15:50Z

@jaclyn-taroni I'm going to request you for review since we talked about a lot of this beforehand, but please let me know if I should send it to someone else.

jaclyn-taroni

I will leave a review on the content of the notebook soon, but I am returning these comments about the environment now. I pushed the relevant CI workflow changes to this branch since @allyhawkins is out today, and it makes sense to me that we want to find out if we have a problem sooner rather than later!

jaclyn-taroni · 2024-12-16T13:25:03Z

analyses/cell-type-consensus/renv.lock

+      "Package": "AnnotationDbi",
+      "Version": "1.68.0",
+      "Source": "Bioconductor",
+      "Repository": "Bioconductor 3.20",


This is a different version of Bioconductor than the one specified at line 32. I can't be 100% sure this is responsible for my problems with renv::restore() locally, but it seems like a good candidate:

Warning: failed to find binary for 'BiocVersion 3.20.0' in package repositories Warning: failed to find source for 'BiocVersion 3.20.0' in package repositories Warning: error downloading 'https://bioconductor.org/packages/3.19/bioc/src/contrib/Archive/BiocVersion/BiocVersion_3.20.0.tar.gz' [error code 22]

Looks like the problem isn't just my local setup: https://github.com/AlexsLemonade/OpenScPCA-analysis/actions/runs/12354014198/job/34474381650?pr=936

Yes, the fact that there is a mix of Bioc versions in the renv seems likely to be a problem. I'm quite surprised (but not totally shocked) that renv allowed that, but I am quite confident it is the problem.

I think the move is most likely to revert (or delete) the renv.lock file, make sure the proper versions are set in renv/settings.json (they seem to be), then run renv::hydrate() to get a consistent set of packages.

Just sticking an observation here: session info reports R version 4.4.2

Just sticking an observation here: session info reports R version 4.4.2

That's because I'm using R 4.4.2. That was the only version of R 4.4 I could find when I was setting up my new laptop that's available for arm (https://cran.r-project.org/bin/macosx/). I looked through the old versions and couldn't find a version for arm, only Intel, so maybe I'm looking in the wrong place?

Are we okay with using R 4.4.2 and BioC 3.20 for modules? The mismatching versions was because I had run the create-analysis.py script with the --renv flag which initiated the lock file. But when I actually created the project and started installing packages it used 3.20. I was able to fix the errors you were seeing by adjusting the lock file to have R 4.4.2 and Bioconductor 3.20. I just want to be sure we are okay with keeping this version or should I try and change to R 4.4 and BioC 3.19?

The error I'm seeing now looks like a dependency error for igraph and is unrelated to this.

I was able to fix the errors you were seeing by adjusting the lock file to have R 4.4.2 and Bioconductor 3.20. I just want to be sure we are okay with keeping this version or should I try and change to R 4.4 and BioC 3.19?

This seems fine.

Are we okay with using R 4.4.2 and BioC 3.20 for modules? The mismatching versions was because I had run the create-analysis.py script with the --renv flag which initiated the lock file. But when I actually created the project and started installing packages it used 3.20. I was able to fix the errors you were seeing by adjusting the lock file to have R 4.4.2 and Bioconductor 3.20. I just want to be sure we are okay with keeping this version or should I try and change to R 4.4 and BioC 3.19?

This like a bug we should fix... If our setup isn't forcing sticking with the defined Bioconductor version, we could end up with a lot of troubled down the road.

(For the record, I am okay with R 4.4.2, but the difference between Bioc3.19 and 3.20 is potentially more significant)

This like a bug we should fix... If our setup isn't forcing sticking with the defined Bioconductor version, we could end up with a lot of troubled down the road.

This could have been a me problem. I probably didn't use renv::restore() when I first started writing code in this module, so I installed Bioconductor myself and it used the most recent version. So then I'm assuming when I installed any other packages it used the version of Bioconductor I had installed. If we want to use 3.19 then I can go back and delete everything and start again...

jaclyn-taroni · 2024-12-16T13:27:00Z

analyses/cell-type-consensus/renv.lock

Related to my comment about local trouble with renv, I think it's time uncomment the parts of the workflow files that set up renv, etc. to check if things build/work in that context.

analyses/cell-type-consensus/exploratory-notebooks/01-reference-exploration.Rmd

sjspielman · 2024-12-16T13:52:58Z

@allyhawkins, you might want to take care of relevant typos in #944 in this PR :)

jaclyn-taroni · 2024-12-16T13:56:03Z

If I had looked at the Dockerfile, I would have seen that that is not using renv yet. Oops.

jaclyn-taroni

I mostly agree with your conclusions. I am concerned about eliminating epithelial cell for having too many descendants.

jaclyn-taroni · 2024-12-16T15:52:32Z

analyses/cell-type-consensus/exploratory-notebooks/01-reference-exploration.Rmd

+    # list all ancestors and descendants calculate total
+    ancestors = list(ontologyIndex::get_ancestors(cl_ont, cl_ontology)),
+    total_ancestors = length(ancestors),
+    descendants = list(ontologyIndex::get_descendants(cl_ont, cl_ontology)),


I think this is unlikely to matter in practice, but I wonder if we want to be setting exclude_roots = TRUE here?

analyses/cell-type-consensus/exploratory-notebooks/01-reference-exploration.Rmd

jaclyn-taroni · 2024-12-16T17:49:34Z

analyses/cell-type-consensus/exploratory-notebooks/01-reference-exploration.Rmd

+  unique()
+```
+
+The only term in this list that I would be concerned about losing is "neuron". 


What about epithelial cell?

jaclyn-taroni · 2024-12-16T17:50:42Z

analyses/cell-type-consensus/exploratory-notebooks/01-reference-exploration.Rmd

+  dplyr::filter(cl_annotation == "blood cell") 
+```
+
+I think I'm in favor of not having a "blood cell" label, since I'm not sure that it's helpful. 


jaclyn-taroni · 2024-12-16T17:50:51Z

analyses/cell-type-consensus/exploratory-notebooks/01-reference-exploration.Rmd

+  dplyr::filter(cl_annotation == "bone cell")
+```
+
+I think I would also remove bone cell, since hematopoietic stem cells and osteoclasts seem pretty different to me. 


jaclyn-taroni · 2024-12-16T17:51:59Z

analyses/cell-type-consensus/exploratory-notebooks/01-reference-exploration.Rmd

+  dplyr::filter(cl_annotation == "myeloid leukocyte")
+```
+
+I'm torn on this one, because I do think it's helpful to know if something is of the myeloid lineage, but if we aren't keeping lymphocyte then I would argue we shouldn't keep myeloid leukocyte. 


T and B cells are typically easier to tell apart based on gene expression alone than different cell types in the myeloid lineage (at least in bulk settings, in my experience), so I don't think it's quite the same thing as labeling something with lymphocyte.

jaclyn-taroni · 2024-12-16T17:53:04Z

analyses/cell-type-consensus/exploratory-notebooks/01-reference-exploration.Rmd

+
+I'm torn on this one, because I do think it's helpful to know if something is of the myeloid lineage, but if we aren't keeping lymphocyte then I would argue we shouldn't keep myeloid leukocyte. 
+
+#### Progenitor cell


I would not include this one.

jaclyn-taroni · 2024-12-16T17:53:36Z

analyses/cell-type-consensus/exploratory-notebooks/01-reference-exploration.Rmd

+
+Same with `progenitor cell`, I do think it could be helpful to know that something may be a progenitor cell, but when you have a cell with the label for HSC and the label for cells like monocytes or osteoblasts, then maybe we are talking about a tumor cell instead. 
+
+Along those same lines, I think the below terms, `lining cell` and `supporting cell`, are too broad even though they have few descendants. 


Works for me.

jaclyn-taroni · 2024-12-16T18:02:17Z

analyses/cell-type-consensus/exploratory-notebooks/01-reference-exploration.Rmd

+
+I would remove `perivascular cell`, since the cell type labels from PanglaoDB and Blueprint are pretty different from each other. 
+
+## Similarity index 


I'm not convinced this is as useful as LCA + some simple rules (e.g., total descendants threshold with exceptions).

analyses/cell-type-consensus/exploratory-notebooks/01-reference-exploration.Rmd

…tology-exploration

Co-authored-by: Jaclyn Taroni <[email protected]>

allyhawkins · 2024-12-17T19:27:10Z

In 0e89309, I added some notes about keeping myeloid leukocyte and a section to look at the epithelial cell matches. I think the only thing remaining here is figuring out CI, which I hope will be solved by updating the Dockerfile in #950 and running the module GHA on that image.

To summarize the rules we are putting in place right now:

Remove all LCA terms that are a result of having > 1 LCA, with the exception of hematopoietic precursor cell.
Keep all LCA terms with equal to or less than 170 descendants with the exception of: bone cell, lining cell, blood cell, progenitor cell, and supporting cell
Keep neuron and epithelial cell

…tology-exploration

allyhawkins · 2024-12-17T21:18:20Z

This is now passing checks so should be good to go.

jaclyn-taroni

I have one comment about when epithelial cell is an appropriate label that I'd like to see make it to the conclusions. Otherwise, LGTM.

analyses/cell-type-consensus/exploratory-notebooks/01-reference-exploration.Rmd

jaclyn-taroni · 2024-12-17T22:08:18Z

analyses/cell-type-consensus/exploratory-notebooks/01-reference-exploration.Rmd

+```
+
+The PanglaoDB cell types seem to be more specific than the ones present in Blueprint Encode, similar to the observation with neurons.
+We should keep epithelial cell. 


When we codify this, I would say epithelial cell is only acceptable when the Blueprint annotation is Epithelial cells because it's likely just a matter of PanglaoDB being much more specific. When the Blueprint annotation is Keratinocytes, things seem to get a bit weird.

jaclyn-taroni · 2024-12-17T22:10:01Z

analyses/cell-type-consensus/exploratory-notebooks/01-reference-exploration.Rmd

+
+- Pairs should not have more than 1 LCA, with the exception of the matches that have the label hematopoietic precursor cell. 
+- The LCA should have equal to or less than 170 total descendants. 
+- We whould include the term for `neuron` and `epithelial cell` even though they do not pass the threshold for number of descendants. 


See my comment above about when I think epithelial cell is acceptable vs. unacceptable.

Co-authored-by: Jaclyn Taroni <[email protected]>

allyhawkins · 2024-12-18T15:10:13Z

I have one comment about when epithelial cell is an appropriate label that I'd like to see make it to the conclusions. Otherwise, LGTM.

I added a note underneath the table showing the combinations for epithelial cell and to the conclusion in 3676e8d. And I also noted it on #951 for when we go to create the reference. Going to go ahead and merge this in once checks pass.

allyhawkins added 6 commits December 6, 2024 10:18

Merge remote-tracking branch 'AlexsLemonade/main' into allyhawkins/on…

d58c028

…tology-exploration

add LCA to dictionary

9f10320

consensus labels from references notebook

770c1c8

add some headers for tables

6aa406b

rendered notebook

6e9a611

update lockfile

c011170

allyhawkins requested a review from jaclyn-taroni as a code owner December 12, 2024 19:00

jaclyn-taroni added 2 commits December 16, 2024 08:48

Start running cell-type-consensus module CI workflows

27dc591

Add cell-type-consensus to workflows pertaining to all modules

6357a44

jaclyn-taroni reviewed Dec 16, 2024

View reviewed changes

allyhawkins and others added 3 commits December 17, 2024 10:17

Merge remote-tracking branch 'AlexsLemonade/main' into allyhawkins/on…

e14f6a3

…tology-exploration

test updating lock file

7f65e64

Apply suggestions from code review

66cfcf5

Co-authored-by: Jaclyn Taroni <[email protected]>

allyhawkins mentioned this pull request Dec 17, 2024

Update dockerfile in consensus module to use renv #950

Merged

keep myeloid and epithelial

0e89309

allyhawkins added 2 commits December 17, 2024 14:11

run GHA on docker image

80264bd

Merge remote-tracking branch 'AlexsLemonade/main' into allyhawkins/on…

991be1d

…tology-exploration

allyhawkins mentioned this pull request Dec 17, 2024

Add aws CLI to docker image for consensus cell typing #954

Closed

allyhawkins added 3 commits December 17, 2024 15:12

Merge remote-tracking branch 'AlexsLemonade/main' into allyhawkins/on…

2fab7a5

…tology-exploration

install awscli

98eb9df

formatting

f1d165c

allyhawkins requested a review from jaclyn-taroni December 17, 2024 21:17

jaclyn-taroni approved these changes Dec 17, 2024

View reviewed changes

spell part

e3ad414

Co-authored-by: Jaclyn Taroni <[email protected]>

allyhawkins mentioned this pull request Dec 18, 2024

Create reference for consensus cell type labels #951

Closed

add note about keratinocytes

3676e8d

allyhawkins merged commit 695f37c into AlexsLemonade:main Dec 18, 2024
4 checks passed

allyhawkins deleted the allyhawkins/ontology-exploration branch December 18, 2024 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exploration notebook for creating a reference of consensus labels #936

Exploration notebook for creating a reference of consensus labels #936

allyhawkins commented Dec 12, 2024

allyhawkins commented Dec 12, 2024

jaclyn-taroni left a comment

jaclyn-taroni Dec 16, 2024

jaclyn-taroni Dec 16, 2024

jashapiro Dec 16, 2024

jaclyn-taroni Dec 16, 2024

allyhawkins Dec 17, 2024

jaclyn-taroni Dec 17, 2024

jashapiro Dec 17, 2024

allyhawkins Dec 17, 2024

jaclyn-taroni Dec 16, 2024

sjspielman commented Dec 16, 2024

jaclyn-taroni commented Dec 16, 2024

jaclyn-taroni left a comment

jaclyn-taroni Dec 16, 2024

jaclyn-taroni Dec 16, 2024

jaclyn-taroni Dec 16, 2024

jaclyn-taroni Dec 16, 2024

jaclyn-taroni Dec 16, 2024

jaclyn-taroni Dec 16, 2024

jaclyn-taroni Dec 16, 2024

jaclyn-taroni Dec 16, 2024

allyhawkins commented Dec 17, 2024

allyhawkins commented Dec 17, 2024

jaclyn-taroni left a comment

jaclyn-taroni Dec 17, 2024

jaclyn-taroni Dec 17, 2024

allyhawkins commented Dec 18, 2024


		I'm torn on this one, because I do think it's helpful to know if something is of the myeloid lineage, but if we aren't keeping lymphocyte then I would argue we shouldn't keep myeloid leukocyte.

		#### Progenitor cell


		Same with `progenitor cell`, I do think it could be helpful to know that something may be a progenitor cell, but when you have a cell with the label for HSC and the label for cells like monocytes or osteoblasts, then maybe we are talking about a tumor cell instead.

		Along those same lines, I think the below terms, `lining cell` and `supporting cell`, are too broad even though they have few descendants.


		I would remove `perivascular cell`, since the cell type labels from PanglaoDB and Blueprint are pretty different from each other.

		## Similarity index

Exploration notebook for creating a reference of consensus labels #936

Exploration notebook for creating a reference of consensus labels #936

Conversation

allyhawkins commented Dec 12, 2024

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

What is the goal of this pull request?

Briefly describe the general approach you took to achieve this goal.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Results

What types of results does your code produce (e.g., table, figure)?

What is your summary of the results?

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

Are there particularly areas you'd like reviewers to have a close look at?

Author checklists

Analysis module and review

Reproducibility checklist

allyhawkins commented Dec 12, 2024

jaclyn-taroni left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjspielman commented Dec 16, 2024

jaclyn-taroni commented Dec 16, 2024

jaclyn-taroni left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allyhawkins commented Dec 17, 2024

allyhawkins commented Dec 17, 2024

jaclyn-taroni left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allyhawkins commented Dec 18, 2024