Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering by gene_id returns partial results #93

Open
kauralasoo opened this issue Dec 16, 2024 · 5 comments
Open

Filtering by gene_id returns partial results #93

kauralasoo opened this issue Dec 16, 2024 · 5 comments
Assignees

Comments

@kauralasoo
Copy link
Member

The eQTL Catalogue API should store summary statistics in a +/- 1Mb around the gene start position. However, requesting summary statistics via API and filtering by gene_id returns approximately first half of the results (i.e. ~1Mb until the gene start but nothing after the gene start). Here is one example query:

https://www.ebi.ac.uk/eqtl/api/v2/datasets/QTD000266/associations?size=1000&start=0&dataset_id=QTD000266&gene_id=ENSG00000134243

The gene start position for the ENSG00000134243 gene is 109397918.

When exploring the results returned by the API we observe the following:
109397918 - min(assoc$position) = 979112 #This is approximately 1 Mb as expected
max(assoc$position) - 109397918 = 20522 #This is much smaller than 1Mb

Specifically 979112 + 20522 = 999634 is very close to 1Mb, which suggests that if the results are filtered by gene_id, then the API returns results only from the 1Mb windows starting from the first stored result, but it should return everything (the whole 2Mb window).

@tdurham86
Copy link

It seems that I am potentially running into the same (or a similar) issue. I was looking in the Elixir viewer for eQTLs for the FDX2 gene (https://elixir.ut.ee/eqtl/?gene_name=FDX2), focusing on entries with quantification method "ge". There are nine of them, all with p < 1e-5 and all from dataset "QTD000236".

I wanted to download the table, so I used the eQTL Catalogue API tutorial (https://github.com/eQTL-Catalogue/eQTL-Catalogue-resources/blob/master/tutorials/API_v2/eQTL_API_tutorial.md) to try to get the results for the same query. However, when I run the following query built by the tutorial code, it only returns three eQTLs, not nine: https://www.ebi.ac.uk/eqtl/api/v2/datasets/QTD000236/associations?size=1000&start=0&dataset_id=QTD000236&gene_id=ENSG00000267673&nlog10p=5.

If instead of searching by gene ID and p-value, I search for one of the missing eQTLs directly using RS ID I can find it: https://www.ebi.ac.uk/eqtl/api/v2/datasets/QTD000236/associations?size=1000&start=0&dataset_id=QTD000236&rsid=rs281425&nlog10p=5

It seems like only a subset of the relevant eQTLs are returned when searching by gene_id and nlog10p. Thanks for any help or insight you can provide!

@karatugo
Copy link
Member

karatugo commented Jan 9, 2025

I have verified that the API query, when filtering by gene ID, sets the coordinate range from 107,418,806 to 109,418,806—a 2 Mb span around our “anchor” position of 108,418,806. However, there are no variants matching this gene ID between 107,418,806 and 108,418,806. That’s why it appears the API is returning only about 1 Mb of data: the second half of the window (108,418,806–109,418,806) is where the variants for that gene actually exist.

As an example, the following API call:

https://www.ebi.ac.uk/eqtl/api/v2/datasets/QTD000266/associations
?size=1000
&start=2000
&dataset_id=QTD000266
&pos=1:107418806-108418806

shows only a few records at positions 108,387,460 and 108,418,806, none of which are for the queried gene (except the one at 108,418,806). Therefore, it’s not that we only queried half of the intended region—it’s simply that the upstream portion of that region (107,418,806–108,418,806) doesn’t contain any qualifying variants for the specified gene ID.

However, currently, the logic uses the "first record" in region_to_search to define the +/- 1 Mb window. This means queries by gene ID may end up centered around whichever record appears first. @kauralasoo Let me know if you’d like this "first record" logic revised. Otherwise, the API is working as intended.

def _resolve_genomic_region(self, context_filters, key, distance):
region_to_search = self.select(filters=context_filters, key=key)
if len(region_to_search) > 0:
first_record = region_to_search[0]
chromosome, position = (first_record['chromosome'],
first_record['position'])
self.filters.chromosome = chromosome
self.filters.position_start = position - distance if position > distance else 0
self.filters.position_end = position + distance
else:
raise ValueError(("Could not find resource with the following "
f"filters {self.filters.dict(exclude_none=True)}"))

@kauralasoo
Copy link
Member Author

Thanks for looking into this! Yes, that seems to be the problem. Ideally, the region_to_search should be defined based on gene TSS (we call this variable phenotype_pos in other parts of our workflows), but I'm not sure if this information is currently stored in the backend. However, if the records are sorted by genomic position then it would also be fine to define the region_to_search as the position of the first_record and first_record + 2Mb.

Do we know if what is returned is first_record is always the one that has the smalles possible genomic position for that gene/molecular trait?

@tdurham86
Copy link

Thank you for the responses! This makes it clear that my main problem was a misunderstanding of how the API behaves for queries based on gene_id. I was expecting that searching by gene_id would return any eQTLs or sQTLs that affect that gene, and not that the query would be only those within a 2 MB window. I agree that @kauralasoo's proposed changes would make the API behavior more in line with what I expected it to do. However, for genes that are >2 Mb long, would there still potentially be some associations that are missed?

@kauralasoo
Copy link
Member Author

The eQTL Catalogue QTL testing workflow (https://github.com/eQTL-Catalogue/qtlmap) only considers variants within +/-1Mb window from the TSS, so there would not be any additional missing associations. For extremely long genes we would still only report associations from the TSS window.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants