-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Turn ParallelAnalysisBase into dask custom collection #135
Comments
A few test cases |
This is pretty cool! Is there a downside? EDIT: I mean: what are the disadvantages of this approach? |
As far as as I can tell, I don't see limitations from this approach. (at least for the (block) split-apply-combine algorithm). The speed don't seem to be stalled (or even faster? need further benchmarking (Before: 26.77 s, After: 25.1s). From a developer perspective, it might be harder to maintain the code without the knowledge of dask (since custom collection is sort of an "advanced feature"). There might be bits and pieces need to be tuned/adjusted. And since it will be deeply intertwined with dask, it is hard to switch back to other tools. |
Aim
Turn ParallelAnalysisBase into a custom dask collection (https://docs.dask.org/en/latest/custom-collections.html).
Current syntax
Implementatation
class ParallelAnalysisBase(DaskMethodsMixin)
self.prepare_jobs
will works as the first half of the oldself.run()
, i.e. create a dask graph. The difference isParallelAnalysisBase
also stores the graph (and the keys) itself.self.run
will check if the jobs are prepared and run the jobs.self.compute()
or dask.compute(ParallelAnalysisBase) willself.run
.Advantage
Benchmark
TODO
Illustration
TODO
The text was updated successfully, but these errors were encountered: