Filtering by gene_id returns partial results #93

kauralasoo · 2024-12-16T14:47:05Z

The eQTL Catalogue API should store summary statistics in a +/- 1Mb around the gene start position. However, requesting summary statistics via API and filtering by gene_id returns approximately first half of the results (i.e. ~1Mb until the gene start but nothing after the gene start). Here is one example query:

https://www.ebi.ac.uk/eqtl/api/v2/datasets/QTD000266/associations?size=1000&start=0&dataset_id=QTD000266&gene_id=ENSG00000134243

The gene start position for the ENSG00000134243 gene is 109397918.

When exploring the results returned by the API we observe the following:
109397918 - min(assoc$position) = 979112 #This is approximately 1 Mb as expected
max(assoc$position) - 109397918 = 20522 #This is much smaller than 1Mb

Specifically 979112 + 20522 = 999634 is very close to 1Mb, which suggests that if the results are filtered by gene_id, then the API returns results only from the 1Mb windows starting from the first stored result, but it should return everything (the whole 2Mb window).

The text was updated successfully, but these errors were encountered:

tdurham86 · 2024-12-16T19:42:15Z

It seems that I am potentially running into the same (or a similar) issue. I was looking in the Elixir viewer for eQTLs for the FDX2 gene (https://elixir.ut.ee/eqtl/?gene_name=FDX2), focusing on entries with quantification method "ge". There are nine of them, all with p < 1e-5 and all from dataset "QTD000236".

I wanted to download the table, so I used the eQTL Catalogue API tutorial (https://github.com/eQTL-Catalogue/eQTL-Catalogue-resources/blob/master/tutorials/API_v2/eQTL_API_tutorial.md) to try to get the results for the same query. However, when I run the following query built by the tutorial code, it only returns three eQTLs, not nine: https://www.ebi.ac.uk/eqtl/api/v2/datasets/QTD000236/associations?size=1000&start=0&dataset_id=QTD000236&gene_id=ENSG00000267673&nlog10p=5.

If instead of searching by gene ID and p-value, I search for one of the missing eQTLs directly using RS ID I can find it: https://www.ebi.ac.uk/eqtl/api/v2/datasets/QTD000236/associations?size=1000&start=0&dataset_id=QTD000236&rsid=rs281425&nlog10p=5

It seems like only a subset of the relevant eQTLs are returned when searching by gene_id and nlog10p. Thanks for any help or insight you can provide!

karatugo · 2025-01-09T16:53:31Z

I have verified that the API query, when filtering by gene ID, sets the coordinate range from 107,418,806 to 109,418,806—a 2 Mb span around our “anchor” position of 108,418,806. However, there are no variants matching this gene ID between 107,418,806 and 108,418,806. That’s why it appears the API is returning only about 1 Mb of data: the second half of the window (108,418,806–109,418,806) is where the variants for that gene actually exist.

As an example, the following API call:

https://www.ebi.ac.uk/eqtl/api/v2/datasets/QTD000266/associations
?size=1000
&start=2000
&dataset_id=QTD000266
&pos=1:107418806-108418806

shows only a few records at positions 108,387,460 and 108,418,806, none of which are for the queried gene (except the one at 108,418,806). Therefore, it’s not that we only queried half of the intended region—it’s simply that the upstream portion of that region (107,418,806–108,418,806) doesn’t contain any qualifying variants for the specified gene ID.

However, currently, the logic uses the "first record" in region_to_search to define the +/- 1 Mb window. This means queries by gene ID may end up centered around whichever record appears first. @kauralasoo Let me know if you’d like this "first record" logic revised. Otherwise, the API is working as intended.

eQTL-SumStats/sumstats/api_v2/services/qtl_data.py

Lines 103 to 114 in 80a006a

    
           def _resolve_genomic_region(self, context_filters, key, distance): 
        
               region_to_search = self.select(filters=context_filters, key=key) 
        
               if len(region_to_search) > 0: 
        
                   first_record = region_to_search[0] 
        
                   chromosome, position = (first_record['chromosome'], 
        
                                           first_record['position']) 
        
                   self.filters.chromosome = chromosome 
        
                   self.filters.position_start = position - distance if position > distance else 0 
        
                   self.filters.position_end = position + distance 
        
               else: 
        
                   raise ValueError(("Could not find resource with the following " 
        
                                     f"filters {self.filters.dict(exclude_none=True)}"))

kauralasoo · 2025-01-22T17:47:30Z

Thanks for looking into this! Yes, that seems to be the problem. Ideally, the region_to_search should be defined based on gene TSS (we call this variable phenotype_pos in other parts of our workflows), but I'm not sure if this information is currently stored in the backend. However, if the records are sorted by genomic position then it would also be fine to define the region_to_search as the position of the first_record and first_record + 2Mb.

Do we know if what is returned is first_record is always the one that has the smalles possible genomic position for that gene/molecular trait?

tdurham86 · 2025-01-23T14:51:34Z

Thank you for the responses! This makes it clear that my main problem was a misunderstanding of how the API behaves for queries based on gene_id. I was expecting that searching by gene_id would return any eQTLs or sQTLs that affect that gene, and not that the query would be only those within a 2 MB window. I agree that @kauralasoo's proposed changes would make the API behavior more in line with what I expected it to do. However, for genes that are >2 Mb long, would there still potentially be some associations that are missed?

kauralasoo · 2025-01-23T16:07:01Z

The eQTL Catalogue QTL testing workflow (https://github.com/eQTL-Catalogue/qtlmap) only considers variants within +/-1Mb window from the TSS, so there would not be any additional missing associations. For extremely long genes we would still only report associations from the TSS window.

sprintell assigned karatugo Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filtering by gene_id returns partial results #93

Filtering by gene_id returns partial results #93

kauralasoo commented Dec 16, 2024

tdurham86 commented Dec 16, 2024

karatugo commented Jan 9, 2025

kauralasoo commented Jan 22, 2025

tdurham86 commented Jan 23, 2025

kauralasoo commented Jan 23, 2025

Filtering by gene_id returns partial results #93

Filtering by gene_id returns partial results #93

Comments

kauralasoo commented Dec 16, 2024

tdurham86 commented Dec 16, 2024

karatugo commented Jan 9, 2025

kauralasoo commented Jan 22, 2025

tdurham86 commented Jan 23, 2025

kauralasoo commented Jan 23, 2025