Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype of MPIEvaluator for multi-node workloads #292

Closed
wants to merge 33 commits into from

Conversation

EwoutH
Copy link
Collaborator

@EwoutH EwoutH commented Aug 31, 2023

Very raw prototype of a multi-node evaluator that uses the MPIPoolExecutor from mpi4py.futures.

Ignore all the scripts, the interesting stuff is the new MPIEvaluator class in the evaluators.py file.

It currently is tested with pure Python models, on multiple nodes of the DelftBlue HPC system.

For todo and discussion, see: #266

EwoutH and others added 16 commits May 31, 2023 14:08
Migrate the Python script from mpi4py.MPI to mpi4py.futures. Note that mpi4py.futures requires a separate launcher script to create the MPI environment and launch the main script.

When using mpi4py.futures, the MPIPoolExecutor needs to be started in a separate process. This is because the MPIPoolExecutor must be created and submitted to tasks within an if __name__ == '__main__': block to prevent recursive creation of MPIPoolExecutor instances.

Also, in the bash commands:
- Modify the bash code to remove or clear the python-test directory, if it already exists. This ensures a clean environment for every test run.
- Simplify the scp file transfer code, to one line.
Make sure that the scripts also work when the number of cores (in this case 10) is smaller than the number of jobs (20). In the EMAworkbench, when testing a million experiments, there won't be always a million cores available.
The error you're encountering is related to how mpi4py's `MPIPoolExecutor` works under the hood.

When you're launching your script with `mpirun`, it's spawning multiple MPI processes. If you use `MPIPoolExecutor`, it tries to spawn additional processes (or threads, in this case) for each MPI process. This is why you see a conflict: you're essentially trying to spawn processes on cores that are already allocated, leading to the "All nodes which are allocated for this job are already filled" error.

Here's how to address this:

1. **Avoid Nested Parallelism**: Don't combine `mpirun` with `MPIPoolExecutor`. Either use the typical MPI approach (using send/receive) or use the `MPIPoolExecutor`.

2. **Using `MPIPoolExecutor` without `mpirun`**: The way the `MPIPoolExecutor` works is that you run your Python script normally (i.e., without `mpirun`), and the `MPIPoolExecutor` will manage the creation and distribution of tasks across the MPI processes.

   Adjust your SLURM script:
   ```bash
   #!/bin/bash
   # Other directives...

   # You don't need mpirun here
   python my_model.py > py_test.log
   ```

3. **Adjust the Code**: In your code, since you're launching without `mpirun`, you don't have to worry about the world size or checking against the number of jobs. The `MPIPoolExecutor` will automatically manage the tasks for you.

4. **Optionally, use `MPI.COMM_WORLD.Spawn`**: If you want more control, you can consider using `MPI.COMM_WORLD.Spawn()` to launch the worker processes instead of using the `MPIPoolExecutor`.

Lastly, always be cautious when working on an HPC environment. Nested parallelism can exhaust resources and potentially harm other users' jobs. Always test on a smaller subset of cores/nodes and make sure to monitor your jobs to ensure they're behaving as expected.
callback(experiment, outcomes)


def run_experiment_mpi(packed_data):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You now instantiate an ExperimentRunner for each separate experiment. In principle, you should have to instantiate this just once by making it a global. Not sure about the performance overhead of this. Related, you NWO pack all models for each experiment, again this might not be the most efficient. Using an initializer function can solve both problems (see MultiprocessingEvaluator and ema_multiprocessing.py)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented the global ExperimentRunner in b8bf0e7. Is this what you had in mind?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is now also consistent with how it is handled in multiprocessing

Create some commands to use locally using WSL2 with Ubuntu
@coveralls
Copy link

coveralls commented Sep 14, 2023

Coverage Status

coverage: 80.667% (-0.2%) from 80.893% when pulling 89817ae on multi-node-development into b76b487 on master.

EwoutH and others added 10 commits September 18, 2023 09:48
…tion

This commit introduces two primary architectural modifications to the MPIEvaluator class, aimed at improving efficiency when running experiments on HPC systems:

1. **Singleton ExperimentRunner**:
    Previously, the `ExperimentRunner` was instantiated for every individual experiment. This approach could introduce unnecessary overhead, especially when dealing with a large number of experiments. Instead, we've adopted a pattern where the `ExperimentRunner` is instantiated once and shared among all the worker processes in the MPI pool. This is achieved using an initializer function `mpi_initializer` which sets up a global `ExperimentRunner` for all the worker processes.

2. **Optimized Model Packing**:
    Before this change, all models were packed and sent with each experiment. This was potentially inefficient, especially when the size of the model objects was large. Now, we've altered the architecture to send only the model name with each experiment. Since the `ExperimentRunner` already has access to all models (being initialized once with all of them), it can easily fetch the necessary model using the provided model name.

The primary motivation behind these changes is to reduce the overhead related to object instantiation and data transfer, especially when running experiments on large-scale HPC systems with SLURM.
6.2.2 works, 6.3.0 give very strange errors pointing to Java ASM incompatibility. Possibly an error in PyNetLogo. See NetLogo/NetLogo#2171
Identical to the current DelftBlue commands.txt, to be able to track changes
Rewrite for NetLogo.

Current problem: Extracting with tar -xzf NetLogo-6.2.2-64.tgz takes extremely long.
Update paths used in NetLogo.py for DelftBlue
If you only need it in one method of one class, just import it in that one method of that one class (and not in the __init__).

Makes sure mpi4py is only required when the MPIEvaluator is used. Challenging part will be conditionally importing MPI to establish the rank based on which evaluator is used.
The logging statements in the run_experiment_mpi() function don't adhere to the model logger level and formatting. TODO: Research and fix this.
- Introduced logging level propagation to the `MPIEvaluator` to ensure consistent logging across MPI processes.
- Adjusted the `mpi_initializer` function to accept a logger level and configure the logging for each worker accordingly.
- Changed the logging statements within `MPIEvaluator` and `run_experiment_mpi` from DEBUG to INFO for better visibility.
- Removed a redundant TODO regarding logging in `run_experiment_mpi`.
- Modified the `log_to_stderr` method in `ema_logging.py` to allow for setting logging levels on all root module loggers.
- Simplified the logging setup in the `ema_model.py` script, and ensured proper propagation of logging levels.
@EwoutH
Copy link
Collaborator Author

EwoutH commented Oct 27, 2023

@quaquel Finally fixed the propagation of logging levels in 89817ae. Could review the changes to the evaluators.py and ema_logging.py. Especially please especially check the set_root_logger_levels argument, and if you find it necessary at all (or if it can be default behaviour).

From my perspective, the function changes are done. So running on multiple nodes and logging are implemented successfully, but non-Python models is (might still work, but I didn't manage).

My current plan is to spend next week writing some documentation, examples, maybe a short tutorial and doing some quick performance measurements. Then, I will clean-up all commits on a new branch and open a PR.

@EwoutH
Copy link
Collaborator Author

EwoutH commented Nov 15, 2023

Closing this PR, since it's succeeded by #299.

Future discussion about the new MPIEvaluator is welcomed at #311!

@EwoutH EwoutH closed this Nov 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants