Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for workflow_future_map #80

Open
yonicd opened this issue May 5, 2022 · 8 comments
Open

add support for workflow_future_map #80

yonicd opened this issue May 5, 2022 · 8 comments
Labels
feature a feature request or enhancement

Comments

@yonicd
Copy link

yonicd commented May 5, 2022

right now workflow_map runs sequentially over the rows in the workflowset object and allows for parallelization within a model tune. For users that have HPC it would be great for a way to set a plan to control each row in the workflow set to be sent to a worker and run indep in each one.

library(future)
plan(list(batchtools_MY_HPC, multisession(workers(n=N))))
@juliasilge
Copy link
Member

Is that preferred over running sequentially over the workflow sets and then parallelizing each individual workflow? I'm not sure I see why (but I am not a HPC expert).

@yonicd
Copy link
Author

yonicd commented May 5, 2022

In the setup of workflowsets the models in the tibble are independent.

If there are compute resources that can accommodate running them in parallel, then that would be a preferred option to save time.

For example, if I have N models and I have a nested CV setup with K|P (outer|inner) layers then I would have NKP models to run. even with a simple 2 models, 3 outer and 100 inner you can inflate number of models to run very fast. It would be efficient to run in parallel beyond just the inner loop for tuning a given row of speficications.

@topepo
Copy link
Member

topepo commented May 11, 2022

There's definitely a good use case here but, right now, our implementation doesn't do what you want.

Personally, I think that it is a little risky. To have potentially long running parallel jobs both between- and within-machines might have issues where something goes wrong and you lose the whole thing. I think that we've made workflow sets pretty fault tolerance but haven't tried anything like this.

In the past, I would generate separate scripts per model and send them off to the queuing system.

Would you like to make a PR (@simonpcouch is that ok)? We don't have hardware for your use case so you would have to do testing across machines.

@yonicd
Copy link
Author

yonicd commented May 11, 2022

Thanks for the feedback. My setup currently integrates @wlandau {targets} with {workflowsets} so i dont get into problems of losing partial successful runs, by mapping over the different models in the workflowsets.

I'd be happy to add a PR to show what my intuition for implementing a {furrr} based version of workflow_map would look like, where the fallback default would be plan(sequential) which is basically what there is now.

@yonicd
Copy link
Author

yonicd commented May 11, 2022

Personally, I think that it is a little risky. To have potentially long running parallel jobs both between- and within-machines might have issues where something goes wrong and you lose the whole thing. I think that we've made workflow sets pretty fault tolerance but haven't tried anything like this.

this is very related to this issue that I opened a while back in {furrr}, where there is the weak spot in it to accommodate failed elements. DavisVaughan/furrr#64

@simonpcouch simonpcouch added the feature a feature request or enhancement label May 24, 2022
@mglev1n
Copy link

mglev1n commented Jul 20, 2022

Thanks for the feedback. My setup currently integrates @wlandau {targets} with {workflowsets} so i dont get into problems of losing partial successful runs, by mapping over the different models in the workflowsets.

I'd be happy to add a PR to show what my intuition for implementing a {furrr} based version of workflow_map would look like, where the fallback default would be plan(sequential) which is basically what there is now.

Not sure if there's been any progress on this, but would be a nice feature.

@yonicd - If not, I also use the {targets} package, and was curious if you'd share your implementation? Presumably you're mapping over each of the workflow objects contained in the info column (using either dynamic or static branching) returned by workflowsets::workflow_set()?

Apologies if this is better suited for discussion in the {targets} repo.

@simonpcouch
Copy link
Contributor

Similar request on SO.

@simonpcouch
Copy link
Contributor

If this ever comes to the top of our to-do list, worth reading "Nested parallelism and protection against it' in Bengtsson (2021).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

5 participants