ENH: Searching within a time interval #623

aulemahal · 2023-07-07T22:02:42Z

(I first made this as a comment to PR #291, but I realized afterwards that the PR had been staled for three years, so I figured out that a new issue might be more appropriate.)

We had this issue and need when writing xscen. Our solution does not yet feel generic enough to be implemented here, but if you want to have a look at a first version is here : https://github.com/Ouranosinc/xscen/blob/b261a04ed73e398a60a0632bdb29be324dc3f5b6/xscen/catalog.py#L899-L998

The idea is similar to the staled PR. You have a "date_start" and a "date_end" column in the catalog. For a given "period", the code returns the row where the date_start - date_end interval overlaps with that period. Because of the limitations of pandas <2, we had to use pd.Period objects in our catalogs and the code suffers a bit from this workaround.

In addition to a simple "overlap", our function tries to guess the percentage of the period that is covered by the rows of the dataframe, so we only return the subset if a significant percentage is obtained. This has the restriction that it make sense if the rows of the dataset are not temporally overlapping, like for a single variable divided temporally in multiple files. A "full overlap" would often be too strict because of so many caveats (different calendars, imprecise date bounds).

I recently tried to use datetime64[ms] columns with pandas >= 2, which allows to simplify the function a bit and use more pd.Interval magic. It is here: https://github.com/Ouranosinc/xscen/blob/05054bfbf450c6b332e239e7866f766f51a47ed0/xscen/catalog.py#L892-L974.

There are still some caveats and questions to answer I think.

How do we tell intake-esm which columns are the time bounds ?
How do we solve the "coverage" issue neatly ?

With input from the intake-esm devs and users, we (Ouranos) could consider investing some time into adapting and upstreaming our solution as we would be more than happy to make xscen thinner.

The text was updated successfully, but these errors were encountered:

dcherian · 2023-07-07T22:08:58Z

cc @klindsay28 who has worked on a solution for this

An alternative suggestion is to kerchunk, and then just subset using .sel so zarr handles the subsetting to appropriate files.

aulemahal · 2023-07-07T22:31:40Z

@dcherian indeed! I guess I need to go back and see if kerchunk works well with local netCDFs ? Last time we looked (more than a year ago), it seemed to add more issues than it solved.

Further info : One way we use this functionality in xscen is to find which datasets in the catalog provide the given time period, but then we use the result to repeat the search without a time period and use the actuel full dataset. EX: I want to run my computation on the full simulations, but I need at least the 1990-2020 period. Could be nice if a final implementation would to this logic in a single search (return the dataset if is contains the period, without subsetting).

charles-turner-1 · 2025-01-31T06:39:53Z

@aulemahal have you looked into this any more into the time since? I've looked at the head of xscen and it looks like there are only a couple of lines that have changed since https://github.com/Ouranosinc/xscen/blob/05054bfbf450c6b332e239e7866f766f51a47ed0/xscen/catalog.py#L892-L974 - mostly related to type hints and logging.

I'd be interested in working on adapting your solution into intake-esm, so if there's any more progress in other files or your solution is working & stable it'd be great to know!

aulemahal · 2025-01-31T14:51:07Z

Sadly, I have not looked into this any more since I wrote the issue. The solution in xscen is quite stable I think!

The way it works in xscen requires two columns in the catalog, so one issue I had when thinking on how to adapt it to intake-esm was that required some kind of change in the spec ?

charles-turner-1 · 2025-02-03T04:07:49Z

No worries - great to know the solution is stable.

I'll start looking into this a bit more deeply, and I'll ping you if I've got any questions.

dougiesquire mentioned this issue Jul 22, 2024

Converting notebooks from COSIMA Cookbook to ACCESS-NRI intake catalog COSIMA/cosima-recipes#313

Open

43 tasks

charles-turner-1 mentioned this issue Feb 3, 2025

Next version release timeline #688

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Searching within a time interval #623

ENH: Searching within a time interval #623

aulemahal commented Jul 7, 2023

dcherian commented Jul 7, 2023 •

edited

Loading

aulemahal commented Jul 7, 2023

charles-turner-1 commented Jan 31, 2025 •

edited

Loading

aulemahal commented Jan 31, 2025

charles-turner-1 commented Feb 3, 2025

ENH: Searching within a time interval #623

ENH: Searching within a time interval #623

Comments

aulemahal commented Jul 7, 2023

dcherian commented Jul 7, 2023 • edited Loading

aulemahal commented Jul 7, 2023

charles-turner-1 commented Jan 31, 2025 • edited Loading

aulemahal commented Jan 31, 2025

charles-turner-1 commented Feb 3, 2025

dcherian commented Jul 7, 2023 •

edited

Loading

charles-turner-1 commented Jan 31, 2025 •

edited

Loading