Efficiently calculating neighbourhood stats around many locations in a 2D xarray grid #8429

jgomezdans · 2023-11-08T15:45:38Z

jgomezdans
Nov 8, 2023

I have been trying to find the best way to optimise the following task:

I have a 2D chunked data array, da. Say of size 10 000 x 10 000
I have a set of N pixel locations that refer to said array, x_loc_idx and y_loc_idx. N ~ 20 000
For each of the locations, I want to calculate the mean and standard deviation of a square neighbourhood around the location (say 51 pixels), and then apply a test to each of da[n_samps] based on the stats

I thought that I could use rolling to do this efficiently. I have a dask Gateway cluster (actually, I want to run this on MS's PlanetaryComputer). I thought that if I do an isel after the mean/std calculations to only fish out the required pixels would be a good idea. However, even for a small number of selected pixels and a smallish window size, I run into out of memory issues.

import numpy as np
import xarray as xr

def create_data(N = 3500, n_samps = 5, win_size=21):
    """Creates a chunked data array, and a random selection of samples"""
    R = xr.DataArray(np.random.randn(N, N), dims=["x", "y"]). chunk({"x":256, "y":256})
    x_loc_idx, y_loc_idx = np.random.randint(win_size,N-win_size, (2, n_samps)) #
    return R, xr.DataArray(x_loc_idx, dims="z"), xr.DataArray(y_loc_idx, dims="z")

def do_mean(R, x_loc_idx, y_loc_idx, win_size = 21):
    # Set up rolling window, centred at pixel
    R_spatial = R.rolling({"x": win_size, "y": win_size}, center=True)
    # Calculate mean for a range of pixels
    return R_spatial.mean().isel(x=x_loc_idx, y=y_loc_idx)

def do_std(R, x_loc_idx, y_loc_idx, win_size = 21):
    # Set up rolling window, centred at pixel
    R_spatial = R.rolling({"x": win_size, "y": win_size}, center=True)
    # Calculate mean for a range of pixels
    return R_spatial.std().isel(x=x_loc_idx, y=y_loc_idx)

if __name__ == "__main__":
    R, x_loc_idx, y_loc_idx = create_data()
    for win_size in [51]:
        mu = do_mean(R, x_loc_idx, y_loc_idx, win_size=win_size)
        sigma = do_std(R, x_loc_idx, y_loc_idx, win_size=win_size)

This kind of works, but only with a beefy dask cluster behind. I don't really know whether this is the fastest way of performing these calculations. I have also implemented this as a function that operates on a subset of the original data, which I run as a dask.delayed function. It works, but I was wondering what the best approach is.

Thanks!

jgomezdans · 2023-11-08T16:12:27Z

jgomezdans
Nov 8, 2023
Author

Also, note that while going through this, I came across some weird memory usage pattern calculating the std (set win_size to e.g. 61 and n_samps to 100 and run sigma.compute()).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficiently calculating neighbourhood stats around many locations in a 2D xarray grid #8429

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Efficiently calculating neighbourhood stats around many locations in a 2D xarray grid #8429

jgomezdans Nov 8, 2023

Replies: 1 comment

jgomezdans Nov 8, 2023 Author

jgomezdans
Nov 8, 2023

jgomezdans
Nov 8, 2023
Author