Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exploration notebook for creating a reference of consensus labels #936

Merged

Conversation

allyhawkins
Copy link
Member

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

Closes #889

What is the goal of this pull request?

Here I'm adding a notebook that looks at defining consensus cell types between all possible cell type labels in the PanglaoDB and Blueprint Encode references. For each possible pair in the references, I computed the latest common ancestor (LCA), which is the latest ancestor that is shared between the two labels. Ultimately, I think we could use the LCA as the consensus label for any pairs that meet the criteria that we set (e.g., the LCA has a total number of descendants less than a set threshold). Any pairs that do not meet the criteria would get an "Unknown" label.

Briefly describe the general approach you took to achieve this goal.

  • The first section just looks at the total number of ancestors and descendants in the whole cell ontology. This gives us an idea of what the possible values are. I also labeled some cell types of varying granularity to see where their value lies in comparison to all other possibilities.
  • Then I computed all possible pairs of labels between PanglaoDB and Blueprint Encode and found the LCA for each pair. Note that in some cases there are 2-3 LCA. For each LCA, I looked at the total number of ancestors and descendants. Again, I plotted some cell types with different granularity which helped me define a cutoff for the number of descendants. I wanted to keep T cell but lose lymphocyte so I chose the cutoff to be the peak between those two terms.
  • Based on that cutoff I printed out all the terms that would get saved as consensus labels. I think most terms are helpful to keep, but there were other terms that I don't think we want to keep as a consensus label since they would not be very informative (like lining cell, supporting cell).
  • I looked at specific pairs that resulted in an LCA with a low number of descendants, but still seemed too broad to include as a consensus label. A lot of these cases were pairs that had more than one LCA, so I looked at adding that as a filtering step. If we remove terms that have more than one possible LCA and then use the threshold we lose some of the broader types like lining cell and supporting cell.
  • However, with that filtering we also lose hematopoietic precursor cell and looking at those pairs they do all fall in the category of hematopoietic stem and progenitor cells, which I think is a useful label to have.
  • I also looked at LCA cell types that we would lose with the threshold that we set and the only one that I think is important to keep is neuron. This term has 500 + descendants, but I think knowing if a cell is a neuron is pretty useful information.
  • The last thing I did was look at the similarity index for all possible pairs. If we wanted to use a metric to define the pairs to label rather than a threshold for number of descendants this could be a possibility. However, I think we still would need to use the LCA as the consensus label, and if that label is too broad, even if the similarity index is close to 1, I don't think we would want to use it. I think if we want to be specific about each of the pairs that we label with a consensus label, then the similarity index would be helpful; otherwise, I don't know that we need this, so I didn't go too deep here (also, this was getting too long).

If known, do you anticipate filing additional pull requests to complete this analysis module?

Yes

Results

What types of results does your code produce (e.g., table, figure)?

All the results are in the rendered copy of the notebook. I'm including it here for easy review:
01-reference-exploration.html.zip

What is your summary of the results?

Generally I think we can use this to guide creation of our own consensus label reference. I invision this being a table with all possible cell type label combinations between PanglaoDB and Blueprint Encode and then a column with the LCA or consensus label. Along side this, we would have a whitelist of acceptable LCA terms that either pass the threshold that we set for number of descendants or are granular enough that we would want to keep. I'm wary of just relying on the threshold since terms like neuron would get lost while terms like lining cell would be kept.

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

I ran this locally using the renv environment.

Are there particularly areas you'd like reviewers to have a close look at?

Are there other metrics or items that you think we should look at?

Author checklists

Check all those that apply.
Note that you may find it easier to check off these items after the pull request is actually filed.

Analysis module and review

Reproducibility checklist

  • Code in this pull request has been added to the GitHub Action workflow that runs this module.
  • The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
  • If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
  • If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

@allyhawkins
Copy link
Member Author

@jaclyn-taroni I'm going to request you for review since we talked about a lot of this beforehand, but please let me know if I should send it to someone else.

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will leave a review on the content of the notebook soon, but I am returning these comments about the environment now. I pushed the relevant CI workflow changes to this branch since @allyhawkins is out today, and it makes sense to me that we want to find out if we have a problem sooner rather than later!

"Package": "AnnotationDbi",
"Version": "1.68.0",
"Source": "Bioconductor",
"Repository": "Bioconductor 3.20",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a different version of Bioconductor than the one specified at line 32. I can't be 100% sure this is responsible for my problems with renv::restore() locally, but it seems like a good candidate:

Warning: failed to find binary for 'BiocVersion 3.20.0' in package repositories
Warning: failed to find source for 'BiocVersion 3.20.0' in package repositories
Warning: error downloading 'https://bioconductor.org/packages/3.19/bioc/src/contrib/Archive/BiocVersion/BiocVersion_3.20.0.tar.gz' [error code 22]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the fact that there is a mix of Bioc versions in the renv seems likely to be a problem. I'm quite surprised (but not totally shocked) that renv allowed that, but I am quite confident it is the problem.

I think the move is most likely to revert (or delete) the renv.lock file, make sure the proper versions are set in renv/settings.json (they seem to be), then run renv::hydrate() to get a consistent set of packages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just sticking an observation here: session info reports R version 4.4.2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just sticking an observation here: session info reports R version 4.4.2

That's because I'm using R 4.4.2. That was the only version of R 4.4 I could find when I was setting up my new laptop that's available for arm (https://cran.r-project.org/bin/macosx/). I looked through the old versions and couldn't find a version for arm, only Intel, so maybe I'm looking in the wrong place?

Are we okay with using R 4.4.2 and BioC 3.20 for modules? The mismatching versions was because I had run the create-analysis.py script with the --renv flag which initiated the lock file. But when I actually created the project and started installing packages it used 3.20. I was able to fix the errors you were seeing by adjusting the lock file to have R 4.4.2 and Bioconductor 3.20. I just want to be sure we are okay with keeping this version or should I try and change to R 4.4 and BioC 3.19?

The error I'm seeing now looks like a dependency error for igraph and is unrelated to this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to fix the errors you were seeing by adjusting the lock file to have R 4.4.2 and Bioconductor 3.20. I just want to be sure we are okay with keeping this version or should I try and change to R 4.4 and BioC 3.19?

This seems fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we okay with using R 4.4.2 and BioC 3.20 for modules? The mismatching versions was because I had run the create-analysis.py script with the --renv flag which initiated the lock file. But when I actually created the project and started installing packages it used 3.20. I was able to fix the errors you were seeing by adjusting the lock file to have R 4.4.2 and Bioconductor 3.20. I just want to be sure we are okay with keeping this version or should I try and change to R 4.4 and BioC 3.19?

This like a bug we should fix... If our setup isn't forcing sticking with the defined Bioconductor version, we could end up with a lot of troubled down the road.

(For the record, I am okay with R 4.4.2, but the difference between Bioc3.19 and 3.20 is potentially more significant)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This like a bug we should fix... If our setup isn't forcing sticking with the defined Bioconductor version, we could end up with a lot of troubled down the road.

This could have been a me problem. I probably didn't use renv::restore() when I first started writing code in this module, so I installed Bioconductor myself and it used the most recent version. So then I'm assuming when I installed any other packages it used the version of Bioconductor I had installed. If we want to use 3.19 then I can go back and delete everything and start again...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to my comment about local trouble with renv, I think it's time uncomment the parts of the workflow files that set up renv, etc. to check if things build/work in that context.

@sjspielman
Copy link
Member

@allyhawkins, you might want to take care of relevant typos in #944 in this PR :)

@jaclyn-taroni
Copy link
Member

If I had looked at the Dockerfile, I would have seen that that is not using renv yet. Oops.

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly agree with your conclusions. I am concerned about eliminating epithelial cell for having too many descendants.

# list all ancestors and descendants calculate total
ancestors = list(ontologyIndex::get_ancestors(cl_ont, cl_ontology)),
total_ancestors = length(ancestors),
descendants = list(ontologyIndex::get_descendants(cl_ont, cl_ontology)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is unlikely to matter in practice, but I wonder if we want to be setting exclude_roots = TRUE here?

unique()
```

The only term in this list that I would be concerned about losing is "neuron".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about epithelial cell?

dplyr::filter(cl_annotation == "blood cell")
```

I think I'm in favor of not having a "blood cell" label, since I'm not sure that it's helpful.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

dplyr::filter(cl_annotation == "bone cell")
```

I think I would also remove bone cell, since hematopoietic stem cells and osteoclasts seem pretty different to me.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

dplyr::filter(cl_annotation == "myeloid leukocyte")
```

I'm torn on this one, because I do think it's helpful to know if something is of the myeloid lineage, but if we aren't keeping lymphocyte then I would argue we shouldn't keep myeloid leukocyte.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T and B cells are typically easier to tell apart based on gene expression alone than different cell types in the myeloid lineage (at least in bulk settings, in my experience), so I don't think it's quite the same thing as labeling something with lymphocyte.


I'm torn on this one, because I do think it's helpful to know if something is of the myeloid lineage, but if we aren't keeping lymphocyte then I would argue we shouldn't keep myeloid leukocyte.

#### Progenitor cell
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not include this one.


Same with `progenitor cell`, I do think it could be helpful to know that something may be a progenitor cell, but when you have a cell with the label for HSC and the label for cells like monocytes or osteoblasts, then maybe we are talking about a tumor cell instead.

Along those same lines, I think the below terms, `lining cell` and `supporting cell`, are too broad even though they have few descendants.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me.


I would remove `perivascular cell`, since the cell type labels from PanglaoDB and Blueprint are pretty different from each other.

## Similarity index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced this is as useful as LCA + some simple rules (e.g., total descendants threshold with exceptions).

@allyhawkins
Copy link
Member Author

In 0e89309, I added some notes about keeping myeloid leukocyte and a section to look at the epithelial cell matches. I think the only thing remaining here is figuring out CI, which I hope will be solved by updating the Dockerfile in #950 and running the module GHA on that image.

To summarize the rules we are putting in place right now:

  • Remove all LCA terms that are a result of having > 1 LCA, with the exception of hematopoietic precursor cell.
  • Keep all LCA terms with equal to or less than 170 descendants with the exception of: bone cell, lining cell, blood cell, progenitor cell, and supporting cell
  • Keep neuron and epithelial cell

@allyhawkins
Copy link
Member Author

This is now passing checks so should be good to go.

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one comment about when epithelial cell is an appropriate label that I'd like to see make it to the conclusions. Otherwise, LGTM.

```

The PanglaoDB cell types seem to be more specific than the ones present in Blueprint Encode, similar to the observation with neurons.
We should keep epithelial cell.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we codify this, I would say epithelial cell is only acceptable when the Blueprint annotation is Epithelial cells because it's likely just a matter of PanglaoDB being much more specific. When the Blueprint annotation is Keratinocytes, things seem to get a bit weird.


- Pairs should not have more than 1 LCA, with the exception of the matches that have the label hematopoietic precursor cell.
- The LCA should have equal to or less than 170 total descendants.
- We whould include the term for `neuron` and `epithelial cell` even though they do not pass the threshold for number of descendants.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above about when I think epithelial cell is acceptable vs. unacceptable.

Co-authored-by: Jaclyn Taroni <[email protected]>
@allyhawkins
Copy link
Member Author

I have one comment about when epithelial cell is an appropriate label that I'd like to see make it to the conclusions. Otherwise, LGTM.

I added a note underneath the table showing the combinations for epithelial cell and to the conclusion in 3676e8d. And I also noted it on #951 for when we go to create the reference. Going to go ahead and merge this in once checks pass.

@allyhawkins allyhawkins merged commit 695f37c into AlexsLemonade:main Dec 18, 2024
4 checks passed
@allyhawkins allyhawkins deleted the allyhawkins/ontology-exploration branch December 18, 2024 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Exploratory notebook for defining consensus cell types
4 participants