Notebook for exploring consensus cell type labels across ScPCA samples #999

allyhawkins · 2025-01-22T19:20:42Z

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

Closes #970

What is the goal of this pull request?

Here I am adding a notebook that looks at the output from running cell-type-consensus in OpenScPCA-nf. The output from this module is a TSV file for each project with all cells in all libraries and the assigned cell type annotations. The notebook added here reads in all the TSV files and creates a few summary plots to look at the assigned cell types across all samples.

The main things we would like to know are:

Is everything just getting labeled as Unknown? If not, what percentage of cells are we able to assign a label to?
What are the top cell types identified?

Briefly describe the general approach you took to achieve this goal.

I copied over the outputs from the cell-type-consensus module to results. Note that these results are currently in staging and TSVs are on a project level. We do want to modify this workflow to have a single TSV for each sample, so when we do that we might also want to change things here (see Modify consensus module to output results at a sample level OpenScPCA-nf#114).
I wrote a function to read in each of the TSVs and summarize the numbers of cells assigned to each cell type and associated statistics that could be used for plotting. Then I created a dataframe with all summary stats across all projects that is used to create all of the individual plots.
I made a few overall summary plots and in fact we do identify cells that have a consensus between the two methods! Some projects are better than others, which makes sense to me, but overall we are getting some labels that I think are meaningful.
I also made a stacked bar chart showing the distribution of the top cell types identified in each project. Here a "top" cell type is based on how many libraries in a project have at least one cell identified with that cell type. I show the top 9 cell types, including "Unknown", and then group everything else as "All remaining cell types". Just a note for these plots is that because the actual cell types will be different in each project, the colors are going to be slightly different across projects, with the exception of "Unknown" and "All remaining cell types".
I added a table with the project ID and a summary term for the diagnoses found in that project. I did this so that I could have some more informative labels for these plots other than just the project ID. These terms are just taken from the project titles.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Yes

Results

What types of results does your code produce (e.g., table, figure)?

Here's a copy of the rendered report for easy reviewing:
02-explore-consensus-results.nb.html.zip

What is your summary of the results?

I think CellAssign might have been useful!
Generally I see a higher proportion of cell types assigned to leukemia and brain samples than solid tumors, which I would expect. We also see that leukemias tend to be immune cell types and solid tumors have more fibroblasts and muscle cells.

Author checklists

Analysis module and review

This analysis module uses the analysis template and has the expected directory structure.
The analysis module README.md has been updated to reflect code changes in this pull request.
The analytical code is documented and contains comments.
Any results and/or plots this code produces have been added to your S3 bucket for review.

Reproducibility checklist

Code in this pull request has been added to the GitHub Action workflow that runs this module.
The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

…plore-consensus-labels

jaclyn-taroni · 2025-01-23T13:19:08Z

@allyhawkins – I took an initial look at this, and I think we'd benefit from developing a strategy for a unified color palette. The same cell type gets different colors across projects of the same cancer type, which is a little bit of an obstacle for interpretation.

allyhawkins · 2025-01-23T16:08:38Z

@allyhawkins – I took an initial look at this, and I think we'd benefit from developing a strategy for a unified color palette. The same cell type gets different colors across projects of the same cancer type, which is a little bit of an obstacle for interpretation.

If we want to plot the top 8 cell types + Unknown + all remaining cell types for each project then we have 27 unique cell types. To assign unique colors, I'm using the alphabet palette. We could decrease the number of cell types that we show per project, but I think for exploratory purposes this is okay?
Note that to get below 20 cell types we would have to decrease to 5 cell types.

Here's an updated report:
02-explore-consensus-results.nb.html.zip

jashapiro

As this is an exploratory notebook, I think this does a good job of illustrating the main points you wanted to convey. There are a few places where I think that you might actually have a few extra plots that are not needed, and one plot that I would modify to aid intrepretability (the second sina plot I think should be histograms, perhaps?).

As @jaclyn-taroni noted, I do think unifying the plot colors would be useful. I guess the question there is exactly how many colors you need to deal with; it seems that many of the sets are similar, which might be a good place to start. Maybe if there is a good set of 10-15 cell types you could use across all plots and then use the "Other" set more liberally? I don't think that the precise cell types are particularly important for the bar plots, so having more classified as other is probably not going to be a big deal? (Update: I see that you changed this and added a new color palette, so I am going to have to have another look soon.)

Other than that, I had a number of little style/code suggestions. Most of those could be considered optional.

jashapiro · 2025-01-23T14:05:53Z

analyses/cell-type-consensus/exploratory-notebooks/02-explore-consensus-results.Rmd

+
+# stacked bar chart showing the distribution of the top 9 cell types for each project, including Unknown
+celltype_distribution_plot <- function(label,
+                                       results_df = all_results_df){


It is a bit strange to have a default value here which has not been defined. I would probably not include a default here.

Suggested change

results_df = all_results_df){

results_df = all_results_df){

Another option here, because this function is really only used in one place, is to define the function inline. I know we don't usually do that, but in this case, where it is closely related to the only place the function is used (and you are likely to be modifying it and looking at the results in that location), I think it may actually be worthwhile.

jashapiro · 2025-01-23T14:06:43Z

analyses/cell-type-consensus/exploratory-notebooks/02-explore-consensus-results.Rmd

+
+
+  # get color assignments 
+  # put grey for unknoqn and all remaining cells at the end of list since they will show up last 


Suggested change

# put grey for unknoqn and all remaining cells at the end of list since they will show up last

# put grey for unknown and all remaining cells at the end of list since they will show up last

jashapiro · 2025-01-23T14:08:43Z

analyses/cell-type-consensus/exploratory-notebooks/02-explore-consensus-results.Rmd

+
+  # filter to get only results for the specified project 
+  plot_df <- results_df |> 
+    dplyr::filter(project_label == {{label}}) |>


Is {{}} needed here? The input seems like it is a string here, so I would not expect it.

Suggested change

dplyr::filter(project_label == {{label}}) |>

dplyr::filter(project_label == {{label}}) |>

jashapiro · 2025-01-23T14:22:41Z

analyses/cell-type-consensus/exploratory-notebooks/02-explore-consensus-results.Rmd

+  colors <- c(
+    palette.colors(palette = "Set1")[1:num_assigned_celltypes], # first get colors for cell types that are not unknown or all remaining
+    "grey95" # level for unknown
+  ) 


Agreeing with @jaclyn-taroni here: we probably want to set up some default colors so that the cell types are consistent. Though part of me wonders whether we really care about the specific cell types in these figures vs. just the proportion with consensus? I don't think the information is valueless, but I don't know whether these figures are the best way to represent it.

jashapiro · 2025-01-23T14:26:57Z

analyses/cell-type-consensus/exploratory-notebooks/02-explore-consensus-results.Rmd

+```{r packages}
+suppressPackageStartupMessages({
+  # load required packages
+  library(ggplot2)
+})
+
+# Set default ggplot theme
+theme_set(
+  theme_classic()
+)
+```


Since you are using ggplot2 in your functions, I would load the package first. Just how my brain wants it.

jashapiro · 2025-01-23T14:53:55Z

analyses/cell-type-consensus/exploratory-notebooks/02-explore-consensus-results.Rmd

+```{r, fig.height=5}
+# pivot for plotting so above and below count is in the same column 
+high_tumor_df <- high_tumor_df |> 
+  tidyr::pivot_longer(
+    cols = c("all_unknown", "classified_cells"),
+    names_to = "category",
+    values_to = "number_of_samples"
+  )
+
+ggplot(high_tumor_df, aes(x = project_label, y = number_of_samples, fill = category)) +
+  geom_bar(position = "stack", stat= "identity") +
+  theme(axis.text.x = element_text(angle = 90, hjust = 1, size = rel(0.9))) +
+  labs(
+    x = "", 
+    y = "Number of samples",
+    fill = ""
+  ) +
+  scale_fill_manual(values = c( "#FFCB05", "#00274C"))
+```


I don't think this plot is particularly needed. The table seems sufficient, especially with the previous sina plot.

analyses/cell-type-consensus/exploratory-notebooks/02-explore-consensus-results.Rmd

jashapiro · 2025-01-23T15:07:05Z

analyses/cell-type-consensus/exploratory-notebooks/02-explore-consensus-results.Rmd

+project_labels |> 
+  purrr::map(celltype_distribution_plot)


Because I find the list text between plots annoying, perhaps:

Suggested change

project_labels |>

purrr::map(celltype_distribution_plot)

project_labels |>

purrr::map(celltype_distribution_plot) |>

patchwork::wrap_plots()

analyses/cell-type-consensus/exploratory-notebooks/02-explore-consensus-results.Rmd

jashapiro · 2025-01-23T16:18:57Z

If we want to plot the top 8 cell types + Unknown + all remaining cell types for each project then we have 27 unique cell types. To assign unique colors, I'm using the alphabet palette. We could decrease the number of cell types that we show per project, but I think for exploratory purposes this is okay?
Note that to get below 20 cell types we would have to decrease to 5 cell types.

This looks pretty good to me as far as the number of cell types you are showing and the exploratory nature of the notebook.

jaclyn-taroni · 2025-01-23T16:49:28Z

If we want to plot the top 8 cell types + Unknown + all remaining cell types for each project then we have 27 unique cell types. To assign unique colors, I'm using the alphabet palette. We could decrease the number of cell types that we show per project, but I think for exploratory purposes this is okay?

Note that to get below 20 cell types we would have to decrease to 5 cell types.

This looks pretty good to me as far as the number of cell types you are showing and the exploratory nature of the notebook.

Agreed

Co-authored-by: Joshua Shapiro <[email protected]>

allyhawkins · 2025-01-23T18:49:56Z

@jashapiro and @jaclyn-taroni I made some minor changes based on Josh's review, including removing the maize and blue stacked plot (😢), using histograms for the number of cells, and making the distribution plots a little prettier when they print. This should be ready for another review.

02-explore-consensus-results.nb.html.zip

jashapiro

Looks good! I will just note that the html file you shared was not a complete render (the final table was missing, at least), so you might want to rerun the notebook and commit any changes before merging. I didn't check the version that was actually part of this commit, though.

jashapiro · 2025-01-23T20:12:12Z

analyses/cell-type-consensus/exploratory-notebooks/02-explore-consensus-results.Rmd

+  purrr::map(\(label){
+
+    project_df <- plot_df |> 
+      dplyr::filter(project_label == {{label}}) |> 


I don't think you answered whether the {{ was needed, but I don't think it is?

Suggested change

dplyr::filter(project_label == {{label}}) |>

dplyr::filter(project_label == label) |>

Sorry I had this here in the function because it was yelling at me, but it's not doing that anymore, so I'll remove it.

jaclyn-taroni

LGTM for an exploratory notebook! I expect we might want to discuss polishing these figures eventually.

allyhawkins added 6 commits January 22, 2025 12:46

notebook for exploring consensus results

121e735

add table with diagnosis for each project

c6fb8e9

Merge remote-tracking branch 'AlexsLemonade/main' into allyhawkins/ex…

2640f12

…plore-consensus-labels

add forcats and ggforce to lock file

d37fc18

add notebook readme

896cacc

add sample info directory

faa7c0e

allyhawkins requested a review from jaclyn-taroni as a code owner January 22, 2025 19:20

allyhawkins requested a review from jashapiro January 22, 2025 19:20

use alphabet palette

5c476b8

jashapiro reviewed Jan 23, 2025

View reviewed changes

allyhawkins and others added 2 commits January 23, 2025 12:23

Apply suggestions from code review

32dda67

Co-authored-by: Joshua Shapiro <[email protected]>

adjust plots based on review

76b76fb

allyhawkins requested a review from jashapiro January 23, 2025 18:50

jashapiro approved these changes Jan 23, 2025

View reviewed changes

make sure last table is included in render and remove brackets

e9a8340

jaclyn-taroni approved these changes Jan 24, 2025

View reviewed changes

allyhawkins merged commit b4986b1 into AlexsLemonade:main Jan 24, 2025
5 checks passed

allyhawkins deleted the allyhawkins/explore-consensus-labels branch January 24, 2025 15:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notebook for exploring consensus cell type labels across ScPCA samples #999

Notebook for exploring consensus cell type labels across ScPCA samples #999

allyhawkins commented Jan 22, 2025

jaclyn-taroni commented Jan 23, 2025

allyhawkins commented Jan 23, 2025 •

edited

Loading

jashapiro left a comment

jashapiro Jan 23, 2025

jashapiro Jan 23, 2025

jashapiro Jan 23, 2025

jashapiro Jan 23, 2025

jashapiro Jan 23, 2025

jashapiro Jan 23, 2025

jashapiro Jan 23, 2025

jashapiro Jan 23, 2025

jashapiro commented Jan 23, 2025

jaclyn-taroni commented Jan 23, 2025

allyhawkins commented Jan 23, 2025

jashapiro left a comment

jashapiro Jan 23, 2025

allyhawkins Jan 23, 2025

jaclyn-taroni left a comment



		# get color assignments
		# put grey for unknoqn and all remaining cells at the end of list since they will show up last

	# put grey for unknoqn and all remaining cells at the end of list since they will show up last
	# put grey for unknown and all remaining cells at the end of list since they will show up last

	dplyr::filter(project_label == {{label}}) \|>
	dplyr::filter(project_label == {{label}}) \|>

	dplyr::filter(project_label == {{label}}) \|>
	dplyr::filter(project_label == label) \|>

Notebook for exploring consensus cell type labels across ScPCA samples #999

Notebook for exploring consensus cell type labels across ScPCA samples #999

Conversation

allyhawkins commented Jan 22, 2025

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

What is the goal of this pull request?

Briefly describe the general approach you took to achieve this goal.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Results

What types of results does your code produce (e.g., table, figure)?

What is your summary of the results?

Author checklists

Analysis module and review

Reproducibility checklist

jaclyn-taroni commented Jan 23, 2025

allyhawkins commented Jan 23, 2025 • edited Loading

jashapiro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jashapiro commented Jan 23, 2025

jaclyn-taroni commented Jan 23, 2025

allyhawkins commented Jan 23, 2025

jashapiro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaclyn-taroni left a comment

Choose a reason for hiding this comment

allyhawkins commented Jan 23, 2025 •

edited

Loading