-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check our gene names against EPIC's reference profile genes #125
Comments
After some initial investigation, it seems we indeed have this issue, and quite seriously with
I think rather than trying to force our ensembls into symbols, maybe we want to take the opposite approach: Since you can provide your own reference for
|
It seems likely that the differences here are due to the difference in references. Looking at the paper, it seems like they used hg19/GRCh37, which corresponds to Ensembl 75, whereas we use Ensembl 104, which is based around hg38/GRCh38. So I am not surprised by the difference in symbols, and most of them are likely the difference between contig-based ids for "unnamed" genes and Ens-based names. Unfortunately, my past memory is that most of those are untranslatable between references without going deep, as the Ensembl IDs have usually changed. Overall, my recommendation would be to first identify the reference annotation with the best correspondence to the EPIC gene ids and start from there. |
Noting that I wrote a notebook to accomplish this (hopefully not overly convoluted though as I am sometimes wont to do....) in the branch Also, note to self that if I do file a PR here, I'll also need to update the TPM script to export a version with the original ensembl ids, not only converted to gene symbols. |
This issue came out of discussion in #122.
In the
epic-signature-genes.Rmd
notebook (https://github.com/AlexsLemonade/scpca-paper-figures/blob/9e86f977e52a75cd139ea043fe70782880e15c9c/analysis/bulk-deconvolution/exploratory-notebooks/epic-signature-genes.Rmd), we check gene names in the EPIC signature against our gene names to ensure we have matching symbols.However, we did not check the reference profile gene names, which contain the cell-type-specific information. There are 20-30K genes in those references, each with different variability such that more variable genes are more highly weighted. We should update this notebook to check how many of those gene names are mismatched with ours, and do what we can to match things match as much as possible. If there are lots of mismatches, it would be best to re-run EPIC with those changes and update associated exploration notebooks (forthcoming issues, as needed).
The text was updated successfully, but these errors were encountered: