Skip to content

Commit

Permalink
Refactor missing methods (#2058)
Browse files Browse the repository at this point in the history
<!--Please ensure the PR fulfills the following requirements! -->
<!-- If this is your first PR, make sure to add your details to the
AUTHORS.rst! -->
### Pull Request Checklist:
- [x] This PR addresses an already opened issue (for bug fixes /
features)
- This PR fixes #2000 and fixes #1820 and should please @tlogan2000 .
- [x] Tests for the changes have been added (for bug fixes / features)
- [x] (If applicable) Documentation has been added / updated (for bug
fixes / features)
- [x] CHANGELOG.rst has been updated (with summary of main changes)
- [x] Link to issue (:issue:`number`) and pull request (:pull:`number`)
has been added

### What kind of change does this PR introduce?
Refactor of the Missing objects. I tried to follow a more orthodox OOP
approach. In the new way:

- Objects are initialized with their options, called with the data (+
freq, src_timestep, indexer)
- Subclasses should override:
+ `__init__` to explicitly override the signature and document their
options, but this method should not do anything.
+ `validate`, a static method, which returns False on invalid options
(this is the same as before).
+ `is_missing`, which receives `null`, `count` and `freq`. It does the
same as before.
+ (optionnaly) `_validate_src_timestep`, to validate the `src_timestep`
at call time. Only useful for MissingWMO which is restricted to daily
inputs.
- Before, input validation was done in a few places, now it is almost
only done in `__call__`, which is not meant to be overriden.
- The methods do not receive`null` as a `DataArrayResample` object
anymore, but as a normal `DataArray`. This allows a bit more
flexibility, which I use to optimise `MissingWMO` by using
`resample_map` on the `longest_run` condition. Benchmarking to come.
- New `MissingTwoSteps` subclass used by `MissingPct` and
`AtLeastNValid` (and `MissingWMO`, but not in a new way). This adds a
`subfreq` option which can be used to divide the mask computation in two
steps.
     1. Compute the mask at `subfreq` using the given method
2. Merge the sub-groups at the target `freq` using the "any" method.


### Does this PR introduce a breaking change?
Yes, `MissingBase` and all its children have been modified in breaking
ways. However, these were not exposed in the public API. The convenience
functions should work as they did before.

Some users, though, might have implemented custom missing methods. These
will break, sorry. I hope the new way makes more sense.


### Other information:
I have yet to run `mypy` and tools in the like to see if I really fixed
#2000.

Also, I'll had some benchmarking to see if my change impacted
performance. In preliminary tests, `missing_wmo` ran at least 10x faster
on a dataset of 100 years x 50 points. And it had 1000x fewer dask
tasks.
  • Loading branch information
aulemahal authored Feb 11, 2025
2 parents abc4b0d + a1ba5c9 commit f08b13e
Show file tree
Hide file tree
Showing 7 changed files with 453 additions and 460 deletions.
11 changes: 10 additions & 1 deletion CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,11 @@ Changelog

v0.55.0 (unreleased)
--------------------
Contributors to this version: Juliette Lavoie (:user:`juliettelavoie`), Trevor James Smith (:user:`Zeitsperre`), Sascha Hofmann (:user:`saschahofmann`).
Contributors to this version: Juliette Lavoie (:user:`juliettelavoie`), Trevor James Smith (:user:`Zeitsperre`), Sascha Hofmann (:user:`saschahofmann`), Pascal Bourgault (:user:`aulemahal`).

Breaking changes
^^^^^^^^^^^^^^^^
* Missing value method "WMO" was modified to remove the criterion that the timeseries needs to be continuous (without holes). (:pull:`2058`).

Announcements
^^^^^^^^^^^^^
Expand All @@ -23,12 +27,17 @@ New features and enhancements
* `xclim` now tracks energy usage and carbon emissions ("last run", "average", and "total") during CI workflows using the `eco-ci-energy-estimation` GitHub Action. (:pull:`2046`).
* ``xclim.testing.helpers.test_timeseries`` now accepts a `calendar` argument that is forwarded to ``xr.cftime_range``. (:pull:`2019`).
* New ``xclim.indices.fao_allen98``, exporting the FAO-56 Penman-Monteith equation for potential evapotranspiration (:issue:`2004`, :pull:`2067`).
* Missing values method "pct" and "at_least_n" now accept a new "subfreq" option that allows to compute the missing mask in two steps. When given, the algorithm is applied at this "subfreq" resampling frequency first and then the result is resampled at the target indicator frequency. In the output, a period is invalid if any of its subgroup where flagged as invalid by the chosen method. (:pull:`2058`, :issue:`1820`).

Internal changes
^^^^^^^^^^^^^^^^
* `sphinx-codeautolink` and `pygments` have been temporarily pinned due to breaking API changes. (:pull:`2030`).
* Adjusted the ``TestOfficialYaml`` test to use a dynamic method for finding the installed location of `xclim`. (:pull:`2028`).
* Adjusted two tests for better handling when running in Windows environments. (:pull:`2057`).
* Refactor of the ``xclim.core.missing`` module, usage of the ``Missing`` objects has been broken. (:pull:`2058`, :issue:`1820`, :issue:`2000`).
- Objects are initialized with their options and then called with the data, input frequency, target frequency and indexer.
- Subclasses receive non-resampled DataArray in their ``is_missing`` methods.
- ``MissingWMO`` now uses ``xclim.indices.helpers.resample_map`` which should greatly improve performance in a dask context.

Bug fixes
^^^^^^^^^
Expand Down
20 changes: 9 additions & 11 deletions src/xclim/core/indicator.py
Original file line number Diff line number Diff line change
Expand Up @@ -1529,12 +1529,6 @@ def __init__(self, **kwds):
"Cannot set `missing_options` with `missing` method being from context."
)

# Validate hard-coded missing options
kls = MISSING_METHODS[self.missing]
self._missing = kls.execute
if self.missing_options:
kls.validate(**self.missing_options)

super().__init__(**kwds)

def _history_string(self, das, params):
Expand All @@ -1546,6 +1540,8 @@ def _history_string(self, das, params):

if missing != "skip":
mopts = self.missing_options or OPTIONS[MISSING_OPTIONS].get(missing)
if mopts.get("subfreq", "absent") is None:
mopts.pop("subfreq") # impertinent default
if mopts:
opt_str += f", missing_options={mopts}"

Expand All @@ -1560,17 +1556,19 @@ def _postprocess(self, outs, das, params):
outs = super()._postprocess(outs, das, params)

freq = self._get_missing_freq(params)
if self.missing != "skip" or freq is False:
method = (
self.missing if self.missing != "from_context" else OPTIONS[CHECK_MISSING]
)
if method != "skip" and freq is not False:
# Mask results that do not meet criteria defined by the `missing` method.
# This means all outputs must have the same dimensions as the broadcasted inputs (excluding time)
options = self.missing_options or OPTIONS[MISSING_OPTIONS].get(
self.missing, {}
)
options = self.missing_options or OPTIONS[MISSING_OPTIONS].get(method, {})
misser = MISSING_METHODS[method](**options)

# We flag periods according to the missing method. skip variables without a time coordinate.
src_freq = self.src_freq if isinstance(self.src_freq, str) else None
miss = (
self._missing(da, freq, src_freq, options, params.get("indexer", {}))
misser(da, freq, src_freq, **params.get("indexer", {}))
for da in das.values()
if "time" in da.coords
)
Expand Down
Loading

0 comments on commit f08b13e

Please sign in to comment.