Parallelising many cheap function calls over single dataset #8609

ollie-bell · 2024-01-15T16:55:33Z

ollie-bell
Jan 15, 2024

Looking for best practices/general principles to help me efficiently apply many (1000s) cheap function calls over a single large dataset.

Dataset for my problem is high resolution regional climate projections at daily frequency over approx. 100 years.

Generally compute resources available are a cluster (which can scale in size) of VMs with mounts to shared network storage holding the data in Netcdf files.

Is there a clear best strategy for using Dask to efficiently parallelise this workflow across workers on the cluster? There are a few issues I'm running into whilst prototyping:

Extremely large task graph setting up everything lazily with a single .compute(). This just ran into horrendous bottlenecks.
I next thought about distributing the function calls amongst available workers manually with Dask futures. Calling open_mfdataset() on the client side and passing the whole dataset (lazily loaded) to each worker in Dask futures. Each worker is looping through its batch of function calls and calling .compute() (actually .to_netcdf()) inside the loop. However this seems inefficient because there is a lot of redundant re-reading data from disk (each worker has a "view" of the whole dataset - with this approach we can't have each worker independently load the whole dataset into memory).
I think therefore a better approach is to load a subset of the dataset into each worker memory, then each worker (using Dask futures) loops over the entire list of function calls and writes out results for its subset which can be re-combined in post processing. For my problem the computations are doing over the entire time axis with spatial grid cells independent of each other, therefore it would make sense for each worker to get a spatial subset of the datatset over the full time range.

Am I missing something here? Is Dask futures the correct way to go about this type of problem? Any feedback welcome! Thanks

(Apologies on mobile so formatting is not great)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelising many cheap function calls over single dataset #8609

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Parallelising many cheap function calls over single dataset #8609

ollie-bell Jan 15, 2024

Replies: 0 comments

ollie-bell
Jan 15, 2024