Check our gene names against EPIC's reference profile genes #125

sjspielman · 2025-01-22T22:40:11Z

This issue came out of discussion in #122.

In the epic-signature-genes.Rmd notebook (https://github.com/AlexsLemonade/scpca-paper-figures/blob/9e86f977e52a75cd139ea043fe70782880e15c9c/analysis/bulk-deconvolution/exploratory-notebooks/epic-signature-genes.Rmd), we check gene names in the EPIC signature against our gene names to ensure we have matching symbols.

However, we did not check the reference profile gene names, which contain the cell-type-specific information. There are 20-30K genes in those references, each with different variability such that more variable genes are more highly weighted. We should update this notebook to check how many of those gene names are mismatched with ours, and do what we can to match things match as much as possible. If there are lots of mismatches, it would be best to re-run EPIC with those changes and update associated exploration notebooks (forthcoming issues, as needed).

The text was updated successfully, but these errors were encountered:

sjspielman · 2025-01-23T12:55:32Z

After some initial investigation, it seems we indeed have this issue, and quite seriously with BRef!

For TRef, 12% (2891/23686) of their gene symbols are not in our data.
For BRef, 39% (19582/49902) of their gene symbols are not in our data.

I think rather than trying to force our ensembls into symbols, maybe we want to take the opposite approach: Since you can provide your own reference for EPIC, it might be best to just live in ensembl land instead. This would entail:

Update the TPM script to export both a version with gene symbols (as it currently does) and a version with ensembl ids
Add a script to convert the EPIC gene symbols to ensembl ids and save this reference version as an RDS file which can be read in for EPIC inference
- This might be better as a notebook since I'm betting there may be some edge cases where we'd have to make manual decisions. This would not be an "exploratory" notebook though since we'd want to run it as part of the analysis pipeline.

jashapiro · 2025-01-23T13:44:01Z

It seems likely that the differences here are due to the difference in references. Looking at the paper, it seems like they used hg19/GRCh37, which corresponds to Ensembl 75, whereas we use Ensembl 104, which is based around hg38/GRCh38. So I am not surprised by the difference in symbols, and most of them are likely the difference between contig-based ids for "unnamed" genes and Ens-based names. Unfortunately, my past memory is that most of those are untranslatable between references without going deep, as the Ensembl IDs have usually changed.

Overall, my recommendation would be to first identify the reference annotation with the best correspondence to the EPIC gene ids and start from there. biomaRt is your friend here: https://bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/accessing_ensembl.html#using-archived-versions-of-ensembl

sjspielman · 2025-01-23T21:00:12Z

Noting that I wrote a notebook to accomplish this (hopefully not overly convoluted though as I am sometimes wont to do....) in the branch sjspielman/125-check-genes. I am not going to file a PR immeidately since we'd like to not spend review cycles on this since yet, since we are going to be looking into building our own reference to use with EPIC instead (issue forthcoming).

Also, note to self that if I do file a PR here, I'll also need to update the TPM script to export a version with the original ensembl ids, not only converted to gene symbols.

sjspielman self-assigned this Jan 22, 2025

sjspielman mentioned this issue Jan 23, 2025

Add epic/quantiseq comparison notebook #122

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check our gene names against EPIC's reference profile genes #125

Check our gene names against EPIC's reference profile genes #125

sjspielman commented Jan 22, 2025

sjspielman commented Jan 23, 2025

jashapiro commented Jan 23, 2025

sjspielman commented Jan 23, 2025

Check our gene names against EPIC's reference profile genes #125

Check our gene names against EPIC's reference profile genes #125

Comments

sjspielman commented Jan 22, 2025

sjspielman commented Jan 23, 2025

jashapiro commented Jan 23, 2025

sjspielman commented Jan 23, 2025