Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check our gene names against EPIC's reference profile genes #125

Open
sjspielman opened this issue Jan 22, 2025 · 3 comments
Open

Check our gene names against EPIC's reference profile genes #125

sjspielman opened this issue Jan 22, 2025 · 3 comments
Assignees

Comments

@sjspielman
Copy link
Member

This issue came out of discussion in #122.

In the epic-signature-genes.Rmd notebook (https://github.com/AlexsLemonade/scpca-paper-figures/blob/9e86f977e52a75cd139ea043fe70782880e15c9c/analysis/bulk-deconvolution/exploratory-notebooks/epic-signature-genes.Rmd), we check gene names in the EPIC signature against our gene names to ensure we have matching symbols.

However, we did not check the reference profile gene names, which contain the cell-type-specific information. There are 20-30K genes in those references, each with different variability such that more variable genes are more highly weighted. We should update this notebook to check how many of those gene names are mismatched with ours, and do what we can to match things match as much as possible. If there are lots of mismatches, it would be best to re-run EPIC with those changes and update associated exploration notebooks (forthcoming issues, as needed).

@sjspielman sjspielman self-assigned this Jan 22, 2025
@sjspielman
Copy link
Member Author

After some initial investigation, it seems we indeed have this issue, and quite seriously with BRef!

  • For TRef, 12% (2891/23686) of their gene symbols are not in our data.
  • For BRef, 39% (19582/49902) of their gene symbols are not in our data.

I think rather than trying to force our ensembls into symbols, maybe we want to take the opposite approach: Since you can provide your own reference for EPIC, it might be best to just live in ensembl land instead. This would entail:

  • Update the TPM script to export both a version with gene symbols (as it currently does) and a version with ensembl ids
  • Add a script to convert the EPIC gene symbols to ensembl ids and save this reference version as an RDS file which can be read in for EPIC inference
    • This might be better as a notebook since I'm betting there may be some edge cases where we'd have to make manual decisions. This would not be an "exploratory" notebook though since we'd want to run it as part of the analysis pipeline.

@jashapiro
Copy link
Member

It seems likely that the differences here are due to the difference in references. Looking at the paper, it seems like they used hg19/GRCh37, which corresponds to Ensembl 75, whereas we use Ensembl 104, which is based around hg38/GRCh38. So I am not surprised by the difference in symbols, and most of them are likely the difference between contig-based ids for "unnamed" genes and Ens-based names. Unfortunately, my past memory is that most of those are untranslatable between references without going deep, as the Ensembl IDs have usually changed.

Overall, my recommendation would be to first identify the reference annotation with the best correspondence to the EPIC gene ids and start from there. biomaRt is your friend here: https://bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/accessing_ensembl.html#using-archived-versions-of-ensembl

@sjspielman
Copy link
Member Author

Noting that I wrote a notebook to accomplish this (hopefully not overly convoluted though as I am sometimes wont to do....) in the branch sjspielman/125-check-genes. I am not going to file a PR immeidately since we'd like to not spend review cycles on this since yet, since we are going to be looking into building our own reference to use with EPIC instead (issue forthcoming).

Also, note to self that if I do file a PR here, I'll also need to update the TPM script to export a version with the original ensembl ids, not only converted to gene symbols.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants