-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filtering by gene_id returns partial results #93
Comments
It seems that I am potentially running into the same (or a similar) issue. I was looking in the Elixir viewer for eQTLs for the FDX2 gene (https://elixir.ut.ee/eqtl/?gene_name=FDX2), focusing on entries with quantification method "ge". There are nine of them, all with p < 1e-5 and all from dataset "QTD000236". I wanted to download the table, so I used the eQTL Catalogue API tutorial (https://github.com/eQTL-Catalogue/eQTL-Catalogue-resources/blob/master/tutorials/API_v2/eQTL_API_tutorial.md) to try to get the results for the same query. However, when I run the following query built by the tutorial code, it only returns three eQTLs, not nine: https://www.ebi.ac.uk/eqtl/api/v2/datasets/QTD000236/associations?size=1000&start=0&dataset_id=QTD000236&gene_id=ENSG00000267673&nlog10p=5. If instead of searching by gene ID and p-value, I search for one of the missing eQTLs directly using RS ID I can find it: https://www.ebi.ac.uk/eqtl/api/v2/datasets/QTD000236/associations?size=1000&start=0&dataset_id=QTD000236&rsid=rs281425&nlog10p=5 It seems like only a subset of the relevant eQTLs are returned when searching by gene_id and nlog10p. Thanks for any help or insight you can provide! |
I have verified that the API query, when filtering by gene ID, sets the coordinate range from 107,418,806 to 109,418,806—a 2 Mb span around our “anchor” position of 108,418,806. However, there are no variants matching this gene ID between 107,418,806 and 108,418,806. That’s why it appears the API is returning only about 1 Mb of data: the second half of the window (108,418,806–109,418,806) is where the variants for that gene actually exist. As an example, the following API call:
shows only a few records at positions 108,387,460 and 108,418,806, none of which are for the queried gene (except the one at 108,418,806). Therefore, it’s not that we only queried half of the intended region—it’s simply that the upstream portion of that region (107,418,806–108,418,806) doesn’t contain any qualifying variants for the specified gene ID. However, currently, the logic uses the "first record" in eQTL-SumStats/sumstats/api_v2/services/qtl_data.py Lines 103 to 114 in 80a006a
|
Thanks for looking into this! Yes, that seems to be the problem. Ideally, the Do we know if what is returned is |
Thank you for the responses! This makes it clear that my main problem was a misunderstanding of how the API behaves for queries based on gene_id. I was expecting that searching by gene_id would return any eQTLs or sQTLs that affect that gene, and not that the query would be only those within a 2 MB window. I agree that @kauralasoo's proposed changes would make the API behavior more in line with what I expected it to do. However, for genes that are >2 Mb long, would there still potentially be some associations that are missed? |
The eQTL Catalogue QTL testing workflow (https://github.com/eQTL-Catalogue/qtlmap) only considers variants within +/-1Mb window from the TSS, so there would not be any additional missing associations. For extremely long genes we would still only report associations from the TSS window. |
The eQTL Catalogue API should store summary statistics in a +/- 1Mb around the gene start position. However, requesting summary statistics via API and filtering by gene_id returns approximately first half of the results (i.e. ~1Mb until the gene start but nothing after the gene start). Here is one example query:
https://www.ebi.ac.uk/eqtl/api/v2/datasets/QTD000266/associations?size=1000&start=0&dataset_id=QTD000266&gene_id=ENSG00000134243
The gene start position for the ENSG00000134243 gene is 109397918.
When exploring the results returned by the API we observe the following:
109397918 - min(assoc$position) = 979112 #This is approximately 1 Mb as expected
max(assoc$position) - 109397918 = 20522 #This is much smaller than 1Mb
Specifically 979112 + 20522 = 999634 is very close to 1Mb, which suggests that if the results are filtered by gene_id, then the API returns results only from the 1Mb windows starting from the first stored result, but it should return everything (the whole 2Mb window).
The text was updated successfully, but these errors were encountered: