Parallelising many cheap function calls over single dataset #8609
Unanswered
ollie-bell
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Looking for best practices/general principles to help me efficiently apply many (1000s) cheap function calls over a single large dataset.
Dataset for my problem is high resolution regional climate projections at daily frequency over approx. 100 years.
Generally compute resources available are a cluster (which can scale in size) of VMs with mounts to shared network storage holding the data in Netcdf files.
Is there a clear best strategy for using Dask to efficiently parallelise this workflow across workers on the cluster? There are a few issues I'm running into whilst prototyping:
Extremely large task graph setting up everything lazily with a single .compute(). This just ran into horrendous bottlenecks.
I next thought about distributing the function calls amongst available workers manually with Dask futures. Calling open_mfdataset() on the client side and passing the whole dataset (lazily loaded) to each worker in Dask futures. Each worker is looping through its batch of function calls and calling .compute() (actually .to_netcdf()) inside the loop. However this seems inefficient because there is a lot of redundant re-reading data from disk (each worker has a "view" of the whole dataset - with this approach we can't have each worker independently load the whole dataset into memory).
I think therefore a better approach is to load a subset of the dataset into each worker memory, then each worker (using Dask futures) loops over the entire list of function calls and writes out results for its subset which can be re-combined in post processing. For my problem the computations are doing over the entire time axis with spatial grid cells independent of each other, therefore it would make sense for each worker to get a spatial subset of the datatset over the full time range.
Am I missing something here? Is Dask futures the correct way to go about this type of problem? Any feedback welcome! Thanks
(Apologies on mobile so formatting is not great)
Beta Was this translation helpful? Give feedback.
All reactions