Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ID specification #10

Open
opain opened this issue Oct 24, 2024 · 2 comments
Open

ID specification #10

opain opened this issue Oct 24, 2024 · 2 comments

Comments

@opain
Copy link
Owner

opain commented Oct 24, 2024

Received via email

I'm working with TWAS-GSEA and encountering some challenges with gene identifier handling (specifically ENSG IDs).

For example, I have a twas results file created with using the GTExv8 multi-tissue expression files (like the one available here: https://s3.us-west-1.amazonaws.com/gtex.v8.fusion/ALL/GTExv8.ALL.Brain_Substantia_nigra.tar.gz). This file uses the ENSG IDs instead of the gene symbols. I used the same files to create an appropriate Reference expression txt.gz file with the FeaturePred script. However, when I run TWAS-GSEA script, I do not get the results, only the log info about 0 features with entrez IDs and 0 gene sets have a sufficient number of genes available in the TWAS (please see the log file). I tried it with and without --use_alt_id parameter, but the outcome is the same.

As a potential solution, I replaced the ENSG IDs in the GTExv8 weight files with gene symbols and obtained new twas results and the reference expression file. However, many genes (~15%) do not have an assigned gene symbol, so I had to remove them from the weight files. This time TWAS-GSEA script successfully completed the run, suggesting that the problem was indeed due to the ENSG identifiers. However, since a significant portion of the genes was removed from the weight files during the preparation process, I'm worried that might significantly affect the results (~15% of weight files were removed and, consequentially, when using those processed weight files we lost 31 of 237 entries in the TWAS results file).

So, I have a couple of questions regarding this problem and I would be grateful if you could find some time to answer:

  1. Can the files containing ENSG identifiers be used with TWAS-GSEA script? If yes, should they be in the format "ENSGXXXXX" or "ENSGXXXXX.version"?

  2. If not, is there any reliable way to transform the weight files which contain ENSG identifiers so they can be used with the TWAS-GSEA script, without losing any information?

TWAS-GSEA_SLURM_JOB_LOG.log
TWAS-GSEA_SLURM_JOBS.txt
TWAS-GSEA_LOG.log

@opain
Copy link
Owner Author

opain commented Oct 24, 2024

Thank you for reaching out with your question about using ENSG IDs with the TWAS-GSEA script.

The script has to match the gene IDs in your TWAS to your .gmt file. By default it assumes gene symbol IDs in the TWAS, and entrez IDs in the .gmt file. The script will convert the IDs to match each other. This is why the script works when you converted your TWAS to gene symbols, though as you note, this can lead to a loss of genes.

You can change this behaviour using the --use_alt_id parameter, specifying the name of the column in the TWAS sumstats you would like to use when matching to the .gmt file. For example, if the ENSG IDs are in the column called 'ENSG', then you would write --use_alt_id ENSG. However, when using the --use_alt_id parameter, the script does not convert the IDs to match between the TWAS and the .gmt file, so the IDs in the TWAS and the .gmt must already match. For example, if you want to use ENSG IDs, they must be present in the TWAS and in the .gmt file.

I would guess there were two issues with your current setup:

  1. If the log files are still talking about entrez IDs, even when you have used --use_alt_id, I guess you left the paramater blank, rather than specifying a column name. This would lead to it reverting to the default behaviour of assuming gene symbols in the ID column.
  2. From the logs it appears your .gmt file contains entrez IDs, whilst your TWAS contains ENSG IDs. This will not work. You will either need to convert the TWAS IDs to gene symbols to use the scripts default behaviour, or you can convert the .gmt file to ENSG IDs, and use the --use_alt_id parameter. If the ENSG IDs are in the 'ID' column of your TWAS file, then write `--use_alt_id ID'.

Please let me know how you get on and I will try to address the issue quickly.

@shulibbm
Copy link

Thank you for a prompt reply! We actually had the issue #2 and your suggestion about converting the gmt file to ENSG IDs was very useful. While there is no perfect solution, as some pathways were significantly affected and a few even completely lost during the ID conversion, this approach was still the most effective. Compared to other options, such as converting the weight files to gene symbols or entrez IDs, it resulted in the least information loss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants