-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prototype of MPIEvaluator for multi-node workloads #292
Conversation
Migrate the Python script from mpi4py.MPI to mpi4py.futures. Note that mpi4py.futures requires a separate launcher script to create the MPI environment and launch the main script. When using mpi4py.futures, the MPIPoolExecutor needs to be started in a separate process. This is because the MPIPoolExecutor must be created and submitted to tasks within an if __name__ == '__main__': block to prevent recursive creation of MPIPoolExecutor instances. Also, in the bash commands: - Modify the bash code to remove or clear the python-test directory, if it already exists. This ensures a clean environment for every test run. - Simplify the scp file transfer code, to one line.
Make sure that the scripts also work when the number of cores (in this case 10) is smaller than the number of jobs (20). In the EMAworkbench, when testing a million experiments, there won't be always a million cores available.
The error you're encountering is related to how mpi4py's `MPIPoolExecutor` works under the hood. When you're launching your script with `mpirun`, it's spawning multiple MPI processes. If you use `MPIPoolExecutor`, it tries to spawn additional processes (or threads, in this case) for each MPI process. This is why you see a conflict: you're essentially trying to spawn processes on cores that are already allocated, leading to the "All nodes which are allocated for this job are already filled" error. Here's how to address this: 1. **Avoid Nested Parallelism**: Don't combine `mpirun` with `MPIPoolExecutor`. Either use the typical MPI approach (using send/receive) or use the `MPIPoolExecutor`. 2. **Using `MPIPoolExecutor` without `mpirun`**: The way the `MPIPoolExecutor` works is that you run your Python script normally (i.e., without `mpirun`), and the `MPIPoolExecutor` will manage the creation and distribution of tasks across the MPI processes. Adjust your SLURM script: ```bash #!/bin/bash # Other directives... # You don't need mpirun here python my_model.py > py_test.log ``` 3. **Adjust the Code**: In your code, since you're launching without `mpirun`, you don't have to worry about the world size or checking against the number of jobs. The `MPIPoolExecutor` will automatically manage the tasks for you. 4. **Optionally, use `MPI.COMM_WORLD.Spawn`**: If you want more control, you can consider using `MPI.COMM_WORLD.Spawn()` to launch the worker processes instead of using the `MPIPoolExecutor`. Lastly, always be cautious when working on an HPC environment. Nested parallelism can exhaust resources and potentially harm other users' jobs. Always test on a smaller subset of cores/nodes and make sure to monitor your jobs to ensure they're behaving as expected.
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
callback(experiment, outcomes) | ||
|
||
|
||
def run_experiment_mpi(packed_data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You now instantiate an ExperimentRunner for each separate experiment. In principle, you should have to instantiate this just once by making it a global. Not sure about the performance overhead of this. Related, you NWO pack all models for each experiment, again this might not be the most efficient. Using an initializer function can solve both problems (see MultiprocessingEvaluator and ema_multiprocessing.py)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented the global ExperimentRunner in b8bf0e7. Is this what you had in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this is now also consistent with how it is handled in multiprocessing
Create some commands to use locally using WSL2 with Ubuntu
…tion This commit introduces two primary architectural modifications to the MPIEvaluator class, aimed at improving efficiency when running experiments on HPC systems: 1. **Singleton ExperimentRunner**: Previously, the `ExperimentRunner` was instantiated for every individual experiment. This approach could introduce unnecessary overhead, especially when dealing with a large number of experiments. Instead, we've adopted a pattern where the `ExperimentRunner` is instantiated once and shared among all the worker processes in the MPI pool. This is achieved using an initializer function `mpi_initializer` which sets up a global `ExperimentRunner` for all the worker processes. 2. **Optimized Model Packing**: Before this change, all models were packed and sent with each experiment. This was potentially inefficient, especially when the size of the model objects was large. Now, we've altered the architecture to send only the model name with each experiment. Since the `ExperimentRunner` already has access to all models (being initialized once with all of them), it can easily fetch the necessary model using the provided model name. The primary motivation behind these changes is to reduce the overhead related to object instantiation and data transfer, especially when running experiments on large-scale HPC systems with SLURM.
6.2.2 works, 6.3.0 give very strange errors pointing to Java ASM incompatibility. Possibly an error in PyNetLogo. See NetLogo/NetLogo#2171
for more information, see https://pre-commit.ci
Identical to the current DelftBlue commands.txt, to be able to track changes
Rewrite for NetLogo. Current problem: Extracting with tar -xzf NetLogo-6.2.2-64.tgz takes extremely long.
Update paths used in NetLogo.py for DelftBlue
for more information, see https://pre-commit.ci
If you only need it in one method of one class, just import it in that one method of that one class (and not in the __init__). Makes sure mpi4py is only required when the MPIEvaluator is used. Challenging part will be conditionally importing MPI to establish the rank based on which evaluator is used.
The logging statements in the run_experiment_mpi() function don't adhere to the model logger level and formatting. TODO: Research and fix this.
- Introduced logging level propagation to the `MPIEvaluator` to ensure consistent logging across MPI processes. - Adjusted the `mpi_initializer` function to accept a logger level and configure the logging for each worker accordingly. - Changed the logging statements within `MPIEvaluator` and `run_experiment_mpi` from DEBUG to INFO for better visibility. - Removed a redundant TODO regarding logging in `run_experiment_mpi`. - Modified the `log_to_stderr` method in `ema_logging.py` to allow for setting logging levels on all root module loggers. - Simplified the logging setup in the `ema_model.py` script, and ensured proper propagation of logging levels.
82ad53c
to
89817ae
Compare
@quaquel Finally fixed the propagation of logging levels in 89817ae. Could review the changes to the evaluators.py and ema_logging.py. Especially please especially check the From my perspective, the function changes are done. So running on multiple nodes and logging are implemented successfully, but non-Python models is (might still work, but I didn't manage). My current plan is to spend next week writing some documentation, examples, maybe a short tutorial and doing some quick performance measurements. Then, I will clean-up all commits on a new branch and open a PR. |
Very raw prototype of a multi-node evaluator that uses the MPIPoolExecutor from mpi4py.futures.
Ignore all the scripts, the interesting stuff is the new
MPIEvaluator
class in theevaluators.py
file.It currently is tested with pure Python models, on multiple nodes of the DelftBlue HPC system.
For todo and discussion, see: #266