Script to run AUCell on all samples using EWS-FLI1 high/low gene signatures #998

allyhawkins · 2025-01-17T22:20:30Z

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

Closes #985

What is the goal of this pull request?

Here I am adding two new scripts, one to run AUCell on a single SCE object using a set of custom gene signatures for defining tumor cell states (mainly EWS-FLI1 high and EWS-FLI1 low) and a second script to run the first script on all samples in the Ewing project. Ultimately, we want to use the results from AUCell to help label cells based on the specific cell state they are in.

I am using the two custom gene sets that we have stored in references/gene_signatures that are marker gene lists for EWS-FLI1 high and low along with a set of MSigDB gene sets that we have identified from the literature as being potentially useful. It's pretty quick to run AUCell so I figured it wouldn't hurt to just use all the gene sets in that list.

I did not include genes in our two marker gene lists, visser-all-marker-genes.tsv and tumor-cell-state-markers.tsv since the gene lists for each cell type there are quite small and would probably skew the AUC results.

Briefly describe the general approach you took to achieve this goal.

Although we previously had a script for AUCell, it was for a pretty specific use case where we are defining tumor cells. I didn't want to have to make huge edits in that workflow and the code was different enough that I chose to make a new script for this.
This script takes in a single SCE object, the custom gene sets, and a percentage to use for determining the aucMaxRank, which I set to the default of 0.01.
The output of this script is a TSV file with AUC values for each cell and each gene set. I also chose to include the AUC threshold value that is reported by AUCell. I don't know that we will use it, but I think it could be helpful when we go to plot this data.
I then wrote a wrapper script that runs AUCell using these gene signatures on every sample in the project and saves each TSV to the results directory.
The last thing I did was update the documentation where necessary. I know this technically isn't a new "workflow", since we're just running one script, but I still followed a similar format in how I documented it and included it in the main README. Once we create our notebooks that go sample by sample, I would like to reorganize some of the things in this module and move things that aren't used anymore to a sub folder so that things are less crowded and easier to understand. But for now, I just added it to existing documentation.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Yes see #993

Results

What is the name of your results bucket on S3?

s3://researcher-211125375652-us-east-2/cell-type-ewings/results/aucell-ews-signatures

What types of results does your code produce (e.g., table, figure)?

TSV files with the AUC values

What is your summary of the results?

Coming next!

Author checklists

Check all those that apply.
Note that you may find it easier to check off these items after the pull request is actually filed.

Analysis module and review

This analysis module uses the analysis template and has the expected directory structure.
The analysis module README.md has been updated to reflect code changes in this pull request.
The analytical code is documented and contains comments.
Any results and/or plots this code produces have been added to your S3 bucket for review.

Reproducibility checklist

Code in this pull request has been added to the GitHub Action workflow that runs this module.
The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

…cell-script-tumor-cell-states

sjspielman

This all looks good to me, and I was able to run the script locally. Left some comments but nothing major, let me know if I can clarify!

sjspielman · 2025-01-21T15:49:01Z

analyses/cell-type-ewings/README.md

+## Using `AUCell` to calculate gene set signatures 
+
+The script, `run-aucell-ews-gene-signatures.sh`, is used to run `AUCell` with a set of custom gene signatures on all samples in SCPCP000015. 
+By default, AUC values for calculated for all gene signatures in [references/gene_signatures](../references/gene_signatures/) and a set of `MSigDB` signatures associated with high and low EWS-FLI1 expression. 


Suggested change

By default, AUC values for calculated for all gene signatures in [references/gene_signatures](../references/gene_signatures/) and a set of `MSigDB` signatures associated with high and low EWS-FLI1 expression.

By default, AUC values for calculated for all gene signatures in [references/gene_signatures](references/gene_signatures/) and a set of `MSigDB` signatures associated with high and low EWS-FLI1 expression.

analyses/cell-type-ewings/README.md

analyses/cell-type-ewings/exploratory_analysis/07-identify-max-rank-aucell.Rmd

analyses/cell-type-ewings/run-aucell-ews-signatures.sh

sjspielman · 2025-01-21T17:12:31Z

analyses/cell-type-ewings/scripts/aucell-ews-signatures/01-aucell.R

+  aucMaxRank = max_rank,
+  #BPPARAM = bp_param


Is this meant to be commented out?

sjspielman · 2025-01-21T17:23:11Z

analyses/cell-type-ewings/scripts/aucell-ews-signatures/01-aucell.R

+# grab msigdb gene sets 
+ews_gene_sets <- c(
+  "staege" = "STAEGE_EWING_FAMILY_TUMOR",
+  "miyagawa_up" = "MIYAGAWA_TARGETS_OF_EWSR1_ETS_FUSIONS_UP",
+  "miyagawa_down" = "MIYAGAWA_TARGETS_OF_EWSR1_ETS_FUSIONS_DN",
+  "zhang"= "ZHANG_TARGETS_OF_EWSR1_FLI1_FUSION", 
+  "riggi_up" = "RIGGI_EWING_SARCOMA_PROGENITOR_UP",
+  "riggi_down" = "RIGGI_EWING_SARCOMA_PROGENITOR_DN",
+  "kinsey_up" = "KINSEY_TARGETS_OF_EWSR1_FLII_FUSION_UP",
+  "kinsey_down"= "KINSEY_TARGETS_OF_EWSR1_FLII_FUSION_DN"
+)


I'm actually wondering now whether we want to make this a stand-alone TSV file to read in here, and we can format it into a named vector in the script since I think that may be more maintainable should anything change here.

I'm also wondering about the Wrenn msigdb sets as seen in references/README.md.. Do we want to use those here?

- [`HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION`](https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION.html) - [`REGULATION_OF_EXTRACELLULAR_MATRIX_ORGANIZATION`](https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/GOBP_REGULATION_OF_EXTRACELLULAR_MATRIX_ORGANIZATION.html)

analyses/cell-type-ewings/README.md

Co-authored-by: Stephanie Spielman <[email protected]>

allyhawkins · 2025-01-21T18:31:18Z

I'm actually wondering now whether we want to make this a stand-alone TSV file to read in here, and we can format it into a named vector in the script since I think that may be more maintainable should anything change here.

I'm also wondering about the Wrenn msigdb sets as seen in references/README.md.. Do we want to use those here?
- [`HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION`](https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION.html)
- [`REGULATION_OF_EXTRACELLULAR_MATRIX_ORGANIZATION`](https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/GOBP_REGULATION_OF_EXTRACELLULAR_MATRIX_ORGANIZATION.html)

I updated this to include a TSV file msigdb-gene-sets.tsv that gets read in and used to grab the gene sets from MSigDB. I also added those two extra gene sets. To be honest, I just didn't want to deal with getting gene sets from different collections, but that's silly and we should include them which is easier to do with the TSV file! This involved some code updates to account for the fact that we have multiple collections, but nothing too crazy.

I re-ran the workflow and updated the results on S3 to reflect these changes. This should be ready for another look!

sjspielman

I caught a few bugs which need to be fixed and the code re-run accordingly, but otherwise I think this is good to go! I don't need to see this again unless you'd like to tag me back.

analyses/cell-type-ewings/run-aucell-ews-signatures.sh

analyses/cell-type-ewings/scripts/aucell-ews-signatures/01-aucell.R

sjspielman · 2025-01-21T19:28:58Z

analyses/cell-type-ewings/references/msigdb-gene-sets.tsv

+riggi_down	RIGGI_EWING_SARCOMA_PROGENITOR_DN	C2	CGP
+kinsey_up	KINSEY_TARGETS_OF_EWSR1_FLII_FUSION_UP	C2	CGP
+kinsey_down	KINSEY_TARGETS_OF_EWSR1_FLII_FUSION_DN	C2	CGP
+hallmark_EMT	HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION	H	


Can we actually add NA for the subcategory here instead of leaving empty? It does get read as NA within R, but seems safer to actually write it out here.

Yes, I'll change this but then will still keep the change to use NULL in the script.

analyses/cell-type-ewings/references/msigdb-gene-sets.tsv

analyses/cell-type-ewings/run-aucell-ews-signatures.sh

Co-authored-by: Stephanie Spielman <[email protected]>

allyhawkins · 2025-01-21T21:18:20Z

Noting that I made the updates requested, re-ran the script, and then re-uploaded results. I'll merge this in once checks pass!

allyhawkins added 9 commits January 17, 2025 15:51

fix typo

9a75299

script to run aucell with ews signatures

01568b1

wrapper script for running aucell on all samples

20839b9

document aucell script

5dee7c4

Merge remote-tracking branch 'AlexsLemonade/main' into allyhawkins/au…

4ca7adb

…cell-script-tumor-cell-states

relocate gene_sets column

131047b

some more docs updating

bb71ee5

doctoc update

bd12c6b

add to cI

4e39be1

allyhawkins requested a review from jaclyn-taroni as a code owner January 17, 2025 22:20

allyhawkins requested review from sjspielman and removed request for jaclyn-taroni January 17, 2025 22:20

sjspielman reviewed Jan 21, 2025

View reviewed changes

allyhawkins commented Jan 21, 2025

View reviewed changes

analyses/cell-type-ewings/README.md Outdated Show resolved Hide resolved

allyhawkins and others added 6 commits January 21, 2025 11:45

Apply suggestions from code review

c9ce74a

Co-authored-by: Stephanie Spielman <[email protected]>

re-render max rank

d086e03

fix misnamed script

cb3bc0d

add tsv for gene sets

55bea6e

use geneset tsv

118496a

add new file to shell script

55d743a

allyhawkins requested a review from sjspielman January 21, 2025 18:31

sjspielman approved these changes Jan 21, 2025

View reviewed changes

allyhawkins and others added 4 commits January 21, 2025 15:07

Apply suggestions from code review

f5a980e

Co-authored-by: Stephanie Spielman <[email protected]>

use NA in tsv

0bc7e40

check for null file

abedccc

fix docs

9bf02cb

allyhawkins merged commit 51032a5 into AlexsLemonade:main Jan 22, 2025
3 checks passed

allyhawkins deleted the allyhawkins/aucell-script-tumor-cell-states branch January 22, 2025 15:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script to run AUCell on all samples using EWS-FLI1 high/low gene signatures #998

Script to run AUCell on all samples using EWS-FLI1 high/low gene signatures #998

allyhawkins commented Jan 17, 2025 •

edited

Loading

sjspielman left a comment

sjspielman Jan 21, 2025

sjspielman Jan 21, 2025

sjspielman Jan 21, 2025

allyhawkins commented Jan 21, 2025

sjspielman left a comment

sjspielman Jan 21, 2025

allyhawkins Jan 21, 2025

allyhawkins commented Jan 21, 2025 •

edited

Loading

	By default, AUC values for calculated for all gene signatures in [references/gene_signatures](../references/gene_signatures/) and a set of `MSigDB` signatures associated with high and low EWS-FLI1 expression.
	By default, AUC values for calculated for all gene signatures in [references/gene_signatures](references/gene_signatures/) and a set of `MSigDB` signatures associated with high and low EWS-FLI1 expression.

Script to run AUCell on all samples using EWS-FLI1 high/low gene signatures #998

Script to run AUCell on all samples using EWS-FLI1 high/low gene signatures #998

Conversation

allyhawkins commented Jan 17, 2025 • edited Loading

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

What is the goal of this pull request?

Briefly describe the general approach you took to achieve this goal.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Results

What is the name of your results bucket on S3?

What types of results does your code produce (e.g., table, figure)?

What is your summary of the results?

Author checklists

Analysis module and review

Reproducibility checklist

sjspielman left a comment

Choose a reason for hiding this comment

sjspielman Jan 21, 2025

Choose a reason for hiding this comment

sjspielman Jan 21, 2025

Choose a reason for hiding this comment

sjspielman Jan 21, 2025

Choose a reason for hiding this comment

allyhawkins commented Jan 21, 2025

sjspielman left a comment

Choose a reason for hiding this comment

sjspielman Jan 21, 2025

Choose a reason for hiding this comment

allyhawkins Jan 21, 2025

Choose a reason for hiding this comment

allyhawkins commented Jan 21, 2025 • edited Loading

allyhawkins commented Jan 17, 2025 •

edited

Loading

allyhawkins commented Jan 21, 2025 •

edited

Loading