Turn ParallelAnalysisBase into dask custom collection #135

yuxuanzhuang · 2020-08-17T13:41:22Z

Aim

Turn ParallelAnalysisBase into a custom dask collection (https://docs.dask.org/en/latest/custom-collections.html).

Current syntax

u = mda.Universe(TPR, XTC)
ow = u.select_atoms("name OW")
D = pmda.density.DensityAnalysis(ow, delta=1.0)

# Option one (
D.run(n_blocks=2, n_jobs=2)

#  Option three
D.prepare_jobs(n_blocks=2)
D.compute(n_jobs=2)   #  or dask.compute(D)

#  furthermore
dask.compute(D_1, D_2, D_3, D_4...)  #  D_x as an individual analysis job.

Implementatation

class ParallelAnalysisBase(DaskMethodsMixin)
The self.prepare_jobs will works as the first half of the old self.run(), i.e. create a dask graph. The difference is ParallelAnalysisBase also stores the graph (and the keys) itself.
The self.run will check if the jobs are prepared and run the jobs.
As a dask custom collection, self.compute() or dask.compute(ParallelAnalysisBase) will
- first generate a list of results (as saved in self._keys) from self._dsk (dask graph)
- run self.dask_postpersist(), which concludes and rebuilds the results (self._post_reduce), a.k.a the second half of the old self.run.

Advantage

The possibility to run multiple analysis at the same time. It is useful when e.g. we have dozens of short simulations that can only utilize one core each.
can visualize the dask graph with self.visualize()
An possible API to extend to complex analysis. (build complex dask graph)

Benchmark

TODO

Illustration

TODO

The text was updated successfully, but these errors were encountered:

yuxuanzhuang · 2020-08-17T14:41:48Z

A few test cases
https://gist.github.com/yuxuanzhuang/73c80d5e0fe56930bc8a224973cb7903
The last missing image looks like this:

orbeckst · 2020-08-19T06:57:56Z

This is pretty cool!

Is there a downside? EDIT: I mean: what are the disadvantages of this approach?

yuxuanzhuang · 2020-08-20T09:28:30Z

As far as as I can tell, I don't see limitations from this approach. (at least for the (block) split-apply-combine algorithm).

The speed don't seem to be stalled (or even faster? need further benchmarking (Before: 26.77 s, After: 25.1s).

From a developer perspective, it might be harder to maintain the code without the knowledge of dask (since custom collection is sort of an "advanced feature"). There might be bits and pieces need to be tuned/adjusted. And since it will be deeply intertwined with dask, it is hard to switch back to other tools.

yuxuanzhuang linked a pull request Aug 17, 2020 that will close this issue

Turn ParallelAnalysisBase into dask custom collection #136

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Turn ParallelAnalysisBase into dask custom collection #135

Turn ParallelAnalysisBase into dask custom collection #135

yuxuanzhuang commented Aug 17, 2020

yuxuanzhuang commented Aug 17, 2020 •

edited

Loading

orbeckst commented Aug 19, 2020 •

edited

Loading

yuxuanzhuang commented Aug 20, 2020

Turn ParallelAnalysisBase into dask custom collection #135

Turn ParallelAnalysisBase into dask custom collection #135

Comments

yuxuanzhuang commented Aug 17, 2020

Aim

Current syntax

Implementatation

Advantage

Benchmark

Illustration

yuxuanzhuang commented Aug 17, 2020 • edited Loading

orbeckst commented Aug 19, 2020 • edited Loading

yuxuanzhuang commented Aug 20, 2020

yuxuanzhuang commented Aug 17, 2020 •

edited

Loading

orbeckst commented Aug 19, 2020 •

edited

Loading