-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ID specification #10
Comments
Thank you for reaching out with your question about using ENSG IDs with the TWAS-GSEA script. The script has to match the gene IDs in your TWAS to your .gmt file. By default it assumes gene symbol IDs in the TWAS, and entrez IDs in the .gmt file. The script will convert the IDs to match each other. This is why the script works when you converted your TWAS to gene symbols, though as you note, this can lead to a loss of genes. You can change this behaviour using the I would guess there were two issues with your current setup:
Please let me know how you get on and I will try to address the issue quickly. |
Thank you for a prompt reply! We actually had the issue #2 and your suggestion about converting the gmt file to ENSG IDs was very useful. While there is no perfect solution, as some pathways were significantly affected and a few even completely lost during the ID conversion, this approach was still the most effective. Compared to other options, such as converting the weight files to gene symbols or entrez IDs, it resulted in the least information loss. |
Received via email
I'm working with TWAS-GSEA and encountering some challenges with gene identifier handling (specifically ENSG IDs).
For example, I have a twas results file created with using the GTExv8 multi-tissue expression files (like the one available here: https://s3.us-west-1.amazonaws.com/gtex.v8.fusion/ALL/GTExv8.ALL.Brain_Substantia_nigra.tar.gz). This file uses the ENSG IDs instead of the gene symbols. I used the same files to create an appropriate Reference expression txt.gz file with the FeaturePred script. However, when I run TWAS-GSEA script, I do not get the results, only the log info about 0 features with entrez IDs and 0 gene sets have a sufficient number of genes available in the TWAS (please see the log file). I tried it with and without --use_alt_id parameter, but the outcome is the same.
As a potential solution, I replaced the ENSG IDs in the GTExv8 weight files with gene symbols and obtained new twas results and the reference expression file. However, many genes (~15%) do not have an assigned gene symbol, so I had to remove them from the weight files. This time TWAS-GSEA script successfully completed the run, suggesting that the problem was indeed due to the ENSG identifiers. However, since a significant portion of the genes was removed from the weight files during the preparation process, I'm worried that might significantly affect the results (~15% of weight files were removed and, consequentially, when using those processed weight files we lost 31 of 237 entries in the TWAS results file).
So, I have a couple of questions regarding this problem and I would be grateful if you could find some time to answer:
Can the files containing ENSG identifiers be used with TWAS-GSEA script? If yes, should they be in the format "ENSGXXXXX" or "ENSGXXXXX.version"?
If not, is there any reliable way to transform the weight files which contain ENSG identifiers so they can be used with the TWAS-GSEA script, without losing any information?
TWAS-GSEA_SLURM_JOB_LOG.log
TWAS-GSEA_SLURM_JOBS.txt
TWAS-GSEA_LOG.log
The text was updated successfully, but these errors were encountered: