Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notebook for exploring consensus cell type labels across ScPCA samples #999

Merged

Conversation

allyhawkins
Copy link
Member

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

Closes #970

What is the goal of this pull request?

Here I am adding a notebook that looks at the output from running cell-type-consensus in OpenScPCA-nf. The output from this module is a TSV file for each project with all cells in all libraries and the assigned cell type annotations. The notebook added here reads in all the TSV files and creates a few summary plots to look at the assigned cell types across all samples.

The main things we would like to know are:

  • Is everything just getting labeled as Unknown? If not, what percentage of cells are we able to assign a label to?
  • What are the top cell types identified?

Briefly describe the general approach you took to achieve this goal.

  • I copied over the outputs from the cell-type-consensus module to results. Note that these results are currently in staging and TSVs are on a project level. We do want to modify this workflow to have a single TSV for each sample, so when we do that we might also want to change things here (see Modify consensus module to output results at a sample level OpenScPCA-nf#114).
  • I wrote a function to read in each of the TSVs and summarize the numbers of cells assigned to each cell type and associated statistics that could be used for plotting. Then I created a dataframe with all summary stats across all projects that is used to create all of the individual plots.
  • I made a few overall summary plots and in fact we do identify cells that have a consensus between the two methods! Some projects are better than others, which makes sense to me, but overall we are getting some labels that I think are meaningful.
  • I also made a stacked bar chart showing the distribution of the top cell types identified in each project. Here a "top" cell type is based on how many libraries in a project have at least one cell identified with that cell type. I show the top 9 cell types, including "Unknown", and then group everything else as "All remaining cell types". Just a note for these plots is that because the actual cell types will be different in each project, the colors are going to be slightly different across projects, with the exception of "Unknown" and "All remaining cell types".
  • I added a table with the project ID and a summary term for the diagnoses found in that project. I did this so that I could have some more informative labels for these plots other than just the project ID. These terms are just taken from the project titles.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Yes

Results

What types of results does your code produce (e.g., table, figure)?

Here's a copy of the rendered report for easy reviewing:
02-explore-consensus-results.nb.html.zip

What is your summary of the results?

I think CellAssign might have been useful!
Generally I see a higher proportion of cell types assigned to leukemia and brain samples than solid tumors, which I would expect. We also see that leukemias tend to be immune cell types and solid tumors have more fibroblasts and muscle cells.

Author checklists

Analysis module and review

Reproducibility checklist

  • Code in this pull request has been added to the GitHub Action workflow that runs this module.
  • The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
  • If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
  • If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

@jaclyn-taroni
Copy link
Member

@allyhawkins – I took an initial look at this, and I think we'd benefit from developing a strategy for a unified color palette. The same cell type gets different colors across projects of the same cancer type, which is a little bit of an obstacle for interpretation.

@allyhawkins
Copy link
Member Author

allyhawkins commented Jan 23, 2025

@allyhawkins – I took an initial look at this, and I think we'd benefit from developing a strategy for a unified color palette. The same cell type gets different colors across projects of the same cancer type, which is a little bit of an obstacle for interpretation.

If we want to plot the top 8 cell types + Unknown + all remaining cell types for each project then we have 27 unique cell types. To assign unique colors, I'm using the alphabet palette. We could decrease the number of cell types that we show per project, but I think for exploratory purposes this is okay?
Note that to get below 20 cell types we would have to decrease to 5 cell types.

Here's an updated report:
02-explore-consensus-results.nb.html.zip

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is an exploratory notebook, I think this does a good job of illustrating the main points you wanted to convey. There are a few places where I think that you might actually have a few extra plots that are not needed, and one plot that I would modify to aid intrepretability (the second sina plot I think should be histograms, perhaps?).

As @jaclyn-taroni noted, I do think unifying the plot colors would be useful. I guess the question there is exactly how many colors you need to deal with; it seems that many of the sets are similar, which might be a good place to start. Maybe if there is a good set of 10-15 cell types you could use across all plots and then use the "Other" set more liberally? I don't think that the precise cell types are particularly important for the bar plots, so having more classified as other is probably not going to be a big deal? (Update: I see that you changed this and added a new color palette, so I am going to have to have another look soon.)

Other than that, I had a number of little style/code suggestions. Most of those could be considered optional.


# stacked bar chart showing the distribution of the top 9 cell types for each project, including Unknown
celltype_distribution_plot <- function(label,
results_df = all_results_df){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit strange to have a default value here which has not been defined. I would probably not include a default here.

Suggested change
results_df = all_results_df){
results_df = all_results_df){

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option here, because this function is really only used in one place, is to define the function inline. I know we don't usually do that, but in this case, where it is closely related to the only place the function is used (and you are likely to be modifying it and looking at the results in that location), I think it may actually be worthwhile.



# get color assignments
# put grey for unknoqn and all remaining cells at the end of list since they will show up last
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# put grey for unknoqn and all remaining cells at the end of list since they will show up last
# put grey for unknown and all remaining cells at the end of list since they will show up last


# filter to get only results for the specified project
plot_df <- results_df |>
dplyr::filter(project_label == {{label}}) |>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is {{}} needed here? The input seems like it is a string here, so I would not expect it.

Suggested change
dplyr::filter(project_label == {{label}}) |>
dplyr::filter(project_label == {{label}}) |>

Comment on lines 73 to 76
colors <- c(
palette.colors(palette = "Set1")[1:num_assigned_celltypes], # first get colors for cell types that are not unknown or all remaining
"grey95" # level for unknown
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreeing with @jaclyn-taroni here: we probably want to set up some default colors so that the cell types are consistent. Though part of me wonders whether we really care about the specific cell types in these figures vs. just the proportion with consensus? I don't think the information is valueless, but I don't know whether these figures are the best way to represent it.

Comment on lines 99 to 109
```{r packages}
suppressPackageStartupMessages({
# load required packages
library(ggplot2)
})

# Set default ggplot theme
theme_set(
theme_classic()
)
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are using ggplot2 in your functions, I would load the package first. Just how my brain wants it.

Comment on lines 208 to 226
```{r, fig.height=5}
# pivot for plotting so above and below count is in the same column
high_tumor_df <- high_tumor_df |>
tidyr::pivot_longer(
cols = c("all_unknown", "classified_cells"),
names_to = "category",
values_to = "number_of_samples"
)

ggplot(high_tumor_df, aes(x = project_label, y = number_of_samples, fill = category)) +
geom_bar(position = "stack", stat= "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size = rel(0.9))) +
labs(
x = "",
y = "Number of samples",
fill = ""
) +
scale_fill_manual(values = c( "#FFCB05", "#00274C"))
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this plot is particularly needed. The table seems sufficient, especially with the previous sina plot.

Comment on lines 265 to 266
project_labels |>
purrr::map(celltype_distribution_plot)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I find the list text between plots annoying, perhaps:

Suggested change
project_labels |>
purrr::map(celltype_distribution_plot)
project_labels |>
purrr::map(celltype_distribution_plot) |>
patchwork::wrap_plots()

@jashapiro
Copy link
Member

If we want to plot the top 8 cell types + Unknown + all remaining cell types for each project then we have 27 unique cell types. To assign unique colors, I'm using the alphabet palette. We could decrease the number of cell types that we show per project, but I think for exploratory purposes this is okay?
Note that to get below 20 cell types we would have to decrease to 5 cell types.

This looks pretty good to me as far as the number of cell types you are showing and the exploratory nature of the notebook.

@jaclyn-taroni
Copy link
Member

If we want to plot the top 8 cell types + Unknown + all remaining cell types for each project then we have 27 unique cell types. To assign unique colors, I'm using the alphabet palette. We could decrease the number of cell types that we show per project, but I think for exploratory purposes this is okay?

Note that to get below 20 cell types we would have to decrease to 5 cell types.

This looks pretty good to me as far as the number of cell types you are showing and the exploratory nature of the notebook.

Agreed

@allyhawkins
Copy link
Member Author

@jashapiro and @jaclyn-taroni I made some minor changes based on Josh's review, including removing the maize and blue stacked plot (😢), using histograms for the number of cells, and making the distribution plots a little prettier when they print. This should be ready for another review.

02-explore-consensus-results.nb.html.zip

@allyhawkins allyhawkins requested a review from jashapiro January 23, 2025 18:50
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I will just note that the html file you shared was not a complete render (the final table was missing, at least), so you might want to rerun the notebook and commit any changes before merging. I didn't check the version that was actually part of this commit, though.

purrr::map(\(label){

project_df <- plot_df |>
dplyr::filter(project_label == {{label}}) |>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you answered whether the {{ was needed, but I don't think it is?

Suggested change
dplyr::filter(project_label == {{label}}) |>
dplyr::filter(project_label == label) |>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I had this here in the function because it was yelling at me, but it's not doing that anymore, so I'll remove it.

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for an exploratory notebook! I expect we might want to discuss polishing these figures eventually.

@allyhawkins allyhawkins merged commit b4986b1 into AlexsLemonade:main Jan 24, 2025
5 checks passed
@allyhawkins allyhawkins deleted the allyhawkins/explore-consensus-labels branch January 24, 2025 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Summarize findings from consensus labels
3 participants