-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Fix GUI MPI available cores #871
base: master
Are you sure you want to change the base?
[WIP] Fix GUI MPI available cores #871
Conversation
11e8b3c
to
0c6f4df
Compare
@@ -320,6 +321,9 @@ def __init__(self, theme_color="#802989", | |||
# load default parameters | |||
self.params = self.load_parameters(network_configuration) | |||
|
|||
# Number of available cores | |||
self.n_cores = self._available_cores() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, I've been told that one shouldn't use all the cores available when doing multiprocessing ... the user should be told how many cores are available and how many of those they want to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes agreed. I documented the GUI's behavior in issue #870
Using the joblib backend the number of cores is exposed and adjustable. With the MPI backend it is not. MPI uses the total number of cores-1. I was curious if there was a reason it was designed this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought you could adjust the number of cores in MPIBackend too. Is that not the case? I think n_cores - 1 may not be the best default ... but I'm not an MPI expert. I know it's a bad idea with joblibs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just set it to 1 rather than attempt to guess the total number of cores. During the tutorial, the user can be instructed to use more cores. The less magic there is in the code, the less likely it will fail.
Ok. I've exposed the number of cores selection for MPI. The default is set to 1 and the maximum should be the total number of cores available to the process rather than the total number of cores on the system. Note that core availability applies to Linux and Windows systems, Mac will just show the total system cores. This shouldn't be an issue given that this issue mostly applies to running on HPC nodes. Also the user can now adjust the number of cores for MPI manually. |
Sorry, I didn't notice this until now. I just made a comment on the issue page. I'm not opposed to this change, but also don't think it's absolutely necessary to expose the parallel backend API in the GUI with the sole reason that it adds yet another variable for new users to trip over. |
@rythorpe are you saying remove the exposed Backend option from the GUI entirely? We do switch the backend from joblib to MPI via the GUI in the workshops to speed things up for users and make them aware of that functionality, so I think we do need it. Though we can simplify it. I also weighed in on the issue page :) |
I guess it's fair to say we just use MPIBackend and default to joblib if not available ... so no need to expose n_cores for joblibbackend (expose only for mpibackend) and keep it simple |
I think users should be able to specify number of system resources the GUI is utilizing for a number of reasons.
That said, I could only test this because I've exposed the number of cores attribute to the user! |
I've been testing the workshop instances on Oscar (8 cores). I'm finding that MPI will fail on runs with 6-8 cores specified with the message below:
My instance is partitioned 8 cores on a 32 core node.
There must be some sort of hardware restriction that prevents process assignment to certain cores here. In this case we would want to use hnn-core/hnn_core/parallel_backends.py Lines 609 to 623 in ccef332
|
A few things:
Aside from changing the default to |
In addition to Metacell, we also have the GUI available on other shared resources / HPCs like NSG and Oscar. In future workshops, I believe the plan is to encourage more people to start using the GUI on NSG, and I wonder if we'll run into the issue @gtdang found where MPI fails |
I don't think the default for the GUI should be a backend that is an optional dependency that you can't install with the package and which requires variable set-up depending on user hardware and operating system. |
As @jasmainak suggested, the idea would be to default to Is there perhaps another way we can streamline use of the |
OK. I think that check for mpi and changing the backend on the start-up should be easy to implement. I agree that a change to the Given that unknown I feel like it's the easiest for now to expose the GUI argument to allow users to try different values that work for their system environment. |
c303f73
to
0954871
Compare
Added logic to check of packages for mpi use and setting the default backend. b5a07da |
# Displays the default backend options | ||
handle_backend_change(self.widget_backend_selection.value, | ||
self._backend_config_out, self.widget_mpi_cmd, | ||
self.widget_n_jobs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was actually suggesting something simpler ... hnn-core already defaults to JoblibBackend if MPI is not available, so all you had to do was expose MPIBackend and get rid of the JoblibBackend. Anyhow @dylansdaniels do you want this merged for the GUI tutorial today?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That handler in the gui changes the displayed options based on the backend drop-down. It doesn't switch between backends. In hnn-core MPIBackend only switches to joblib if it's a single process, I don't think it checks for MPI and reverts to joblib if it can't find it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, you're right. It does revert to joblib. Though it runs on joblib with only 1 processor, is there a reason why it only uses 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's because specifying multiple processes with the joblib backend, unlike the MPI backend, deosn't make sense for all types of simulations. Since joblib parallelizes over multiple trials, it only works for simulations that include multiple trials. Plus parallel joblib processes are memory intensive, so silently switching from MPI to joblib can produce errors if the user wasn't anticipating needing the extra resources needed for a joblib task with many trials.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @rythorpe . I would say the GUI should just use MPIBackend, and if not available, just run everything in a single core. It's probably a misnomer to call it "JoblibBackend" if it's running on a single core. Hence, I wouldn't bother exposing it to user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know if anyone who is actually using JoblibBackend
in the GUI for the their research? I personally don't know of anyone and am generally under the impression that folks who prefer the GUI would want a general-use parallel backend that works without too much fuss.
Your work @ntolley is unique in that your questions require lots of parameter sweeps, which are best served by using the API directly. Note that limited RAM is only an issue for Joblib, which I'm arguing should essentially default to running a single job within the context of the GUI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure a lot of Darcy's initial work compared over multiple trials, although I'll concede she's using the API now. If you're doing 1-1 comparisons of ERP's for instance it's definitely more robust to compute an average over multiple trials. In tutorials we use 1 trial because it's efficient, but that isn't necessarily how you should fit these signals to data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again I totally agree with making MPI with oversubscription the default, but I genuinely think trial-based parallelism is a useful feature even in the GUI setting!
These are my cards on the table 😄, but if I get outvoted I'll concede
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. If GUI users use it, we should support it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally feel that once a feature is added, users find ways to justify keeping it. That's why one should start with minimal features first and then add based on necessity and feedback from user base. Anyway, call is yours @ntolley !
That's fair. I still think it's non-ideal that MPI oversubscription isn't being invoked when it's supposed too, but we can fix this later if you think more investigation is needed. |
value='Joblib', | ||
description='Backend:') | ||
self.widget_backend_selection = ( | ||
Dropdown(options=[('Joblib', 'Joblib'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't bother exposing this to user
I've made a PR to @gtdang 's branch here: brown-ccv#2 which appears to work in and compensate for determining cores etc. in all of the expected use-cases . The only additions to George's changes that are not due to an master-update-rebase are in this commit: brown-ccv@25c78c0 Pasting the list of use-cases here from the commit message: In Note that this does NOT change the default number of jobs to use for the |
There is some git cleanup going to happen before this is merged: @gtdang is going to rebase his work onto a side branch on upstream, then I will apply my changes to that. This should make the result significantly easier to see the code changes. |
…based on the operation system
… attribute where relevant.
…d on start-up Previously the sub-options were hidden by default and only displayed when the backend dropdown was changed. This hid the number of cores option for the default joblib backend on start-up.
6159e1f
to
8f929ca
Compare
The GUI MPI was not working on HPC systems where the number of cores partitioned to an instance is less than the total cores of a node.
This fix uses the number of available cores to the process instead of the total number of cores of the system.
Work done
closes #870