ARD workflow for time series analysis of ACCESS-OM2-01 daily output #462

sb4233 · 2024-09-30T08:23:02Z

Hi,
I have been trying to do some spectral analysis using variables from ACCESS-OM2 output. Due to its large, chunked data doing any kind of analysis is very slow. For example, I am calculating the coherence between two variables (using scipy.signal.coherence) at every grid point for a specific domain (356x500). Now the actual calculation takes only about 3-4 mins (non-chunked). But due to being chunked it takes forever to do the calculation (as the data is being loaded into memory).

As a cheap alternative I found that saving the data as early as possible in my calculation (for example, saving the data just after selecting the variable for the region of interest) works (i.e., reducing the number of operations that I need to do while the data is in chunked state). But even in that case it takes about several hours per variable to save it in a netcdf file.

I wanted to know if there is a better way to effectively chunk large datasets so that processing time can be reduced (as much as possible).

Maybe adding a method to cosima cookbook which can dynamically chunk large datasets based on the operation that is being performed on it? I am new to this kind of programming so any help would be much appreciated :)

The text was updated successfully, but these errors were encountered:

Thomas-Moore-Creative · 2024-09-30T23:19:37Z

thanks @sb4233 - do you mind if we edit the issue title so that it's more focused and descriptive?

Also: can you provide details on the specific ACCESS-OM2 data variables you are trying to calculate against? Are they 2D? 3D? frequency?

THANKS

navidcy · 2024-10-01T00:27:51Z

Btw, @sb4233 note that the cosima_cookbook Python package is deprecated so there won't be any method added to it.

I think the issue is that the data is chunked in times based on how the files are saved in netCDF files (e.g., every 3 months for 0.1 degree model output). So if one needs to do a time-series analysis at every point you need to rechunk in time. I've bumped onto this before and I didn't find a better solution but perhaps I was just naïve!

Btw, you might wanna have a look at the xrft package? Sorry if I misunderstood and this is not something useful.

marc-white · 2024-10-01T00:34:26Z

@sb4233 would you be able to add some code snippets so we can see what you're trying to do?

sb4233 · 2024-10-01T01:36:22Z

thanks @sb4233 - do you mind if we edit the issue title so that it's more focused and descriptive?

Also: can you provide details on the specific ACCESS-OM2 data variables you are trying to calculate against? Are they 2D? 3D? frequency?

THANKS

Yeah sure, please go ahead and edit the title.
As for the details of my use case -

I am using u, v and SST from ACCESS-OM2-01 which is in daily frequency.
My data arrays are 3D, i.e., (time, lat, lon). u and v are at a particular level.
The dimensions of my data are (time:18250, lat:356, lon:500)
And my aim is to calculate the magnitude squared coherence between each component of the current and SST.

Btw, @sb4233 note that the cosima_cookbook Python package is deprecated so there won't be any method added to it.

I think the issue is that the data is chunked in times based on how the files are saved in netCDF files (e.g., every 3 months for 0.1 degree model output). So if one needs to do a time-series analysis at every point you need to rechunk in time. I've bumped onto this before and I didn't find a better solution but perhaps I was just naïve!

Btw, you might wanna have a look at the xrft package? Sorry if I misunderstood and this is not something useful.

Thanks for the suggestion, seems like xrft can be useful, as it utilizes dask API.

@sb4233 would you be able to add some code snippets so we can see what you're trying to do?

Nothing special, essentially just trying this function below (coherence()) to calculate the squared magnitude coherence between two data arrays at every grid point (i,j) and I use joblib.parallel to parallelly loop over i,j -

def compute_coherence_slice(data_slice1, data_slice2, i, j, fs, nperseg, window, noverlap, nfft):
        f, Cxy = coherence(data_slice1, data_slice2, fs=fs, nperseg=nperseg, window=get_window(window, nperseg), noverlap=noverlap, nfft=nfft)
        return Cxy, f

Thomas-Moore-Creative · 2024-10-01T05:47:38Z

Hey @sb4233, hopefully that new title is representative of your use case ( one shared by others ).

Next steps might be to access daily ACCESS-OM2-01 via intake catalog, including helpful xarray kwargs, followed by writing temporary ARD Zarr collections for u,v, and SST to scratch/vn19? I'll have a go at this in my spare time tonight or tomorrow - but you or others might get there too.

Look forward to documenting better-practice for these specific use cases with you and others.

Thomas-Moore-Creative · 2024-10-01T05:52:57Z

@sb4233 - a very useful ref from @dougiesquire et al.

access-nri-intake-catalog

and for storage of any temporary intermediate ARD collections on vn19 let's please use: /scratch/vn19/ard/ACCESS-OM2-01

Thomas-Moore-Creative · 2024-10-01T21:12:07Z

@sb4233 et al

Here's the kind of overall workflow I'm suggesting each of these specific heuristics could contribute to:

You can see and download our full poster from OMO2024 here: https://go.csiro.au/FwLink/climate_ARD

Thomas-Moore-Creative self-assigned this Sep 30, 2024

Thomas-Moore-Creative added the ARD workflows for analysis-ready data label Sep 30, 2024

Thomas-Moore-Creative changed the title ~~Better and faster processing of large datasets using cosima cookbook~~ ARD workflow for time series analysis of ACCESS-OM2-01 daily output Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARD workflow for time series analysis of ACCESS-OM2-01 daily output #462

ARD workflow for time series analysis of ACCESS-OM2-01 daily output #462

sb4233 commented Sep 30, 2024

Thomas-Moore-Creative commented Sep 30, 2024

navidcy commented Oct 1, 2024

marc-white commented Oct 1, 2024

sb4233 commented Oct 1, 2024

Thomas-Moore-Creative commented Oct 1, 2024

Thomas-Moore-Creative commented Oct 1, 2024

Thomas-Moore-Creative commented Oct 1, 2024

ARD workflow for time series analysis of ACCESS-OM2-01 daily output #462

ARD workflow for time series analysis of ACCESS-OM2-01 daily output #462

Comments

sb4233 commented Sep 30, 2024

Thomas-Moore-Creative commented Sep 30, 2024

navidcy commented Oct 1, 2024

marc-white commented Oct 1, 2024

sb4233 commented Oct 1, 2024

Thomas-Moore-Creative commented Oct 1, 2024

Thomas-Moore-Creative commented Oct 1, 2024

Thomas-Moore-Creative commented Oct 1, 2024