resampling half hourly data to hourly data takes a long time #8042
-
What happened?I have an xarray DataArray with half hourly data, spanning several decades (100 years in the reproducible example). I am attempting to resample this to hourly data and take the mean. This process takes a considerably long time: many minutes, and I have killed the process several times (> 10 minutes) when other dimensions have a high order (no other dimensions are present in the example given below). Repeating this process using Pandas is very quick - Pandas does not have the same issue. NOTE: using xarray, observed that as the resampling frequency decreases, the runtime increases; What did you expect to happen?I expected that the resample would complete very quickly: most functions in xarray have very high performance Minimal Complete Verifiable Example# replicate issue: xarray resampling from half hourly data to hourly takes a long time
# %% import libraries
import numpy as np
import pandas as pd
import xarray as xr
# %% print package versions
print("numpy version", np.__version__)
print("pandas version", pd.__version__)
print("xarray version", xr.__version__)
# %% create half hourly data
time = pd.date_range("2000-01-01", "2100-1-1", freq="30T")
n = len(time)
data = np.random.uniform(size=n)
xarray_array = xr.DataArray(data=data, dims=["time"], coords=dict(time=time))
pandas_series = xarray_array.to_series()
# %% time - xarray: takes a long time ~1 min 40 sec
%%timeit
xarray_hourly = xarray_array.resample(time="H").mean()
# %% time - pandas: fast, ~ 100ms
%%timeit
pandas_hourly = pandas_series.resample("H").mean()
# %% addtional note: for reference, resampling xarray monthly is fine
%%timeit
xarray_monthly = xarray_array.resample(time="M").mean()
# %% addtional note: for reference, resampling xarray weekly takes a bit longer
%%timeit
xarray_weekly = xarray_array.resample(time="W").mean()
# %% addtional note: for reference, resampling xarray daily also takes a long time, ~4 sec
%%timeit
xarray_daily = xarray_array.resample(time="D").mean()
# %% MVCE confirmation
Relevant log output# version
python version 3.11.3
numpy version 1.24.3
pandas version 2.0.1
xarray version 2023.7.0
# xarray resample hourly:
1min 41s ± 2.69 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
# pandas resample hourly:
104 ms ± 3.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) Anything else we need to know?No response EnvironmentINSTALLED VERSIONScommit: None xarray: 2023.7.0 |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
Thanks for opening your first issue here at xarray! Be sure to follow the issue template! |
Beta Was this translation helpful? Give feedback.
-
Can you run The scaling you see is that performance is inversely proportional to number of "groups" or "periods" which is larger for higher frequency resampling. |
Beta Was this translation helpful? Give feedback.
-
This works, installing the flox package improves the performance. Thank you for the quick response, much appreciated. |
Beta Was this translation helpful? Give feedback.
Can you run
mamba install flox
and report back please? See the "tip" hereThe scaling you see is that performance is inversely proportional to number of "groups" or "periods" which is larger for higher frequency resampling.