-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prototype of MPIEvaluator for multi-node workloads #292
Commits on May 31, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 88f0a7e - Browse repository at this point
Copy the full SHA 88f0a7eView commit details -
Migrate to mpi4py.futures, simplify bash script
Migrate the Python script from mpi4py.MPI to mpi4py.futures. Note that mpi4py.futures requires a separate launcher script to create the MPI environment and launch the main script. When using mpi4py.futures, the MPIPoolExecutor needs to be started in a separate process. This is because the MPIPoolExecutor must be created and submitted to tasks within an if __name__ == '__main__': block to prevent recursive creation of MPIPoolExecutor instances. Also, in the bash commands: - Modify the bash code to remove or clear the python-test directory, if it already exists. This ensures a clean environment for every test run. - Simplify the scp file transfer code, to one line.
Configuration menu - View commit details
-
Copy full SHA for e11493a - Browse repository at this point
Copy the full SHA e11493aView commit details
Commits on Jun 1, 2023
-
scripts: Merge model and launcher, ensure they work when cores < jobs
Make sure that the scripts also work when the number of cores (in this case 10) is smaller than the number of jobs (20). In the EMAworkbench, when testing a million experiments, there won't be always a million cores available.
Configuration menu - View commit details
-
Copy full SHA for 1becad4 - Browse repository at this point
Copy the full SHA 1becad4View commit details
Commits on Jun 7, 2023
-
Configuration menu - View commit details
-
Copy full SHA for eae8f31 - Browse repository at this point
Copy the full SHA eae8f31View commit details
Commits on Aug 29, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 63ce62c - Browse repository at this point
Copy the full SHA 63ce62cView commit details -
SLURM script: Let MPIPoolExecutor manage processes
The error you're encountering is related to how mpi4py's `MPIPoolExecutor` works under the hood. When you're launching your script with `mpirun`, it's spawning multiple MPI processes. If you use `MPIPoolExecutor`, it tries to spawn additional processes (or threads, in this case) for each MPI process. This is why you see a conflict: you're essentially trying to spawn processes on cores that are already allocated, leading to the "All nodes which are allocated for this job are already filled" error. Here's how to address this: 1. **Avoid Nested Parallelism**: Don't combine `mpirun` with `MPIPoolExecutor`. Either use the typical MPI approach (using send/receive) or use the `MPIPoolExecutor`. 2. **Using `MPIPoolExecutor` without `mpirun`**: The way the `MPIPoolExecutor` works is that you run your Python script normally (i.e., without `mpirun`), and the `MPIPoolExecutor` will manage the creation and distribution of tasks across the MPI processes. Adjust your SLURM script: ```bash #!/bin/bash # Other directives... # You don't need mpirun here python my_model.py > py_test.log ``` 3. **Adjust the Code**: In your code, since you're launching without `mpirun`, you don't have to worry about the world size or checking against the number of jobs. The `MPIPoolExecutor` will automatically manage the tasks for you. 4. **Optionally, use `MPI.COMM_WORLD.Spawn`**: If you want more control, you can consider using `MPI.COMM_WORLD.Spawn()` to launch the worker processes instead of using the `MPIPoolExecutor`. Lastly, always be cautious when working on an HPC environment. Nested parallelism can exhaust resources and potentially harm other users' jobs. Always test on a smaller subset of cores/nodes and make sure to monitor your jobs to ensure they're behaving as expected.
Configuration menu - View commit details
-
Copy full SHA for 1a94011 - Browse repository at this point
Copy the full SHA 1a94011View commit details
Commits on Aug 30, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 0a61510 - Browse repository at this point
Copy the full SHA 0a61510View commit details -
Configuration menu - View commit details
-
Copy full SHA for 09c3f02 - Browse repository at this point
Copy the full SHA 09c3f02View commit details
Commits on Aug 31, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 597de6a - Browse repository at this point
Copy the full SHA 597de6aView commit details -
Configuration menu - View commit details
-
Copy full SHA for 00d2c1f - Browse repository at this point
Copy the full SHA 00d2c1fView commit details -
Configuration menu - View commit details
-
Copy full SHA for c72fc6c - Browse repository at this point
Copy the full SHA c72fc6cView commit details -
Configuration menu - View commit details
-
Copy full SHA for 469ffe6 - Browse repository at this point
Copy the full SHA 469ffe6View commit details -
[pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Configuration menu - View commit details
-
Copy full SHA for 6fb15e3 - Browse repository at this point
Copy the full SHA 6fb15e3View commit details -
Configuration menu - View commit details
-
Copy full SHA for 806dbb8 - Browse repository at this point
Copy the full SHA 806dbb8View commit details -
[pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Configuration menu - View commit details
-
Copy full SHA for 1c6289f - Browse repository at this point
Copy the full SHA 1c6289fView commit details
Commits on Sep 5, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 8abf05c - Browse repository at this point
Copy the full SHA 8abf05cView commit details
Commits on Sep 14, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 2f9b934 - Browse repository at this point
Copy the full SHA 2f9b934View commit details -
Create some commands to use locally using WSL2 with Ubuntu
Configuration menu - View commit details
-
Copy full SHA for 247c576 - Browse repository at this point
Copy the full SHA 247c576View commit details
Commits on Sep 18, 2023
-
Refactor MPIEvaluator to optimize model handling and runner instantia…
…tion This commit introduces two primary architectural modifications to the MPIEvaluator class, aimed at improving efficiency when running experiments on HPC systems: 1. **Singleton ExperimentRunner**: Previously, the `ExperimentRunner` was instantiated for every individual experiment. This approach could introduce unnecessary overhead, especially when dealing with a large number of experiments. Instead, we've adopted a pattern where the `ExperimentRunner` is instantiated once and shared among all the worker processes in the MPI pool. This is achieved using an initializer function `mpi_initializer` which sets up a global `ExperimentRunner` for all the worker processes. 2. **Optimized Model Packing**: Before this change, all models were packed and sent with each experiment. This was potentially inefficient, especially when the size of the model objects was large. Now, we've altered the architecture to send only the model name with each experiment. Since the `ExperimentRunner` already has access to all models (being initialized once with all of them), it can easily fetch the necessary model using the provided model name. The primary motivation behind these changes is to reduce the overhead related to object instantiation and data transfer, especially when running experiments on large-scale HPC systems with SLURM.
Configuration menu - View commit details
-
Copy full SHA for b8bf0e7 - Browse repository at this point
Copy the full SHA b8bf0e7View commit details
Commits on Sep 21, 2023
-
Scripts: Getting NetLogo working on WSL
6.2.2 works, 6.3.0 give very strange errors pointing to Java ASM incompatibility. Possibly an error in PyNetLogo. See NetLogo/NetLogo#2171
Configuration menu - View commit details
-
Copy full SHA for c09deda - Browse repository at this point
Copy the full SHA c09dedaView commit details -
[pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Configuration menu - View commit details
-
Copy full SHA for bd75003 - Browse repository at this point
Copy the full SHA bd75003View commit details -
Create DelftBlue commands netlogo.txt
Identical to the current DelftBlue commands.txt, to be able to track changes
Configuration menu - View commit details
-
Copy full SHA for 85b46fb - Browse repository at this point
Copy the full SHA 85b46fbView commit details -
Update DelftBlue commands netlogo.txt
Rewrite for NetLogo. Current problem: Extracting with tar -xzf NetLogo-6.2.2-64.tgz takes extremely long.
Configuration menu - View commit details
-
Copy full SHA for 49e02f3 - Browse repository at this point
Copy the full SHA 49e02f3View commit details -
Configuration menu - View commit details
-
Copy full SHA for bae7a77 - Browse repository at this point
Copy the full SHA bae7a77View commit details -
Configuration menu - View commit details
-
Copy full SHA for b17dd76 - Browse repository at this point
Copy the full SHA b17dd76View commit details -
Configuration menu - View commit details
-
Copy full SHA for b0f9eaa - Browse repository at this point
Copy the full SHA b0f9eaaView commit details -
Configuration menu - View commit details
-
Copy full SHA for 23fb912 - Browse repository at this point
Copy the full SHA 23fb912View commit details -
[pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Configuration menu - View commit details
-
Copy full SHA for 9bbecc6 - Browse repository at this point
Copy the full SHA 9bbecc6View commit details
Commits on Oct 9, 2023
-
evaluators.py: Correctly do conditional import
If you only need it in one method of one class, just import it in that one method of that one class (and not in the __init__). Makes sure mpi4py is only required when the MPIEvaluator is used. Challenging part will be conditionally importing MPI to establish the rank based on which evaluator is used.
Configuration menu - View commit details
-
Copy full SHA for 7960054 - Browse repository at this point
Copy the full SHA 7960054View commit details
Commits on Oct 13, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 33165bc - Browse repository at this point
Copy the full SHA 33165bcView commit details -
evaluators.py: Add logging statements to MPIEvaluator
The logging statements in the run_experiment_mpi() function don't adhere to the model logger level and formatting. TODO: Research and fix this.
Configuration menu - View commit details
-
Copy full SHA for 1c986e6 - Browse repository at this point
Copy the full SHA 1c986e6View commit details
Commits on Oct 27, 2023
-
Configuration menu - View commit details
-
Copy full SHA for e8a11f0 - Browse repository at this point
Copy the full SHA e8a11f0View commit details -
Fix propagation of logging levels in MPIEvaluator and related modules
- Introduced logging level propagation to the `MPIEvaluator` to ensure consistent logging across MPI processes. - Adjusted the `mpi_initializer` function to accept a logger level and configure the logging for each worker accordingly. - Changed the logging statements within `MPIEvaluator` and `run_experiment_mpi` from DEBUG to INFO for better visibility. - Removed a redundant TODO regarding logging in `run_experiment_mpi`. - Modified the `log_to_stderr` method in `ema_logging.py` to allow for setting logging levels on all root module loggers. - Simplified the logging setup in the `ema_model.py` script, and ensured proper propagation of logging levels.
Configuration menu - View commit details
-
Copy full SHA for 89817ae - Browse repository at this point
Copy the full SHA 89817aeView commit details