Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype of MPIEvaluator for multi-node workloads #292

Closed
wants to merge 33 commits into from

Commits on May 31, 2023

  1. Configuration menu
    Copy the full SHA
    88f0a7e View commit details
    Browse the repository at this point in the history
  2. Migrate to mpi4py.futures, simplify bash script

    Migrate the Python script from mpi4py.MPI to mpi4py.futures. Note that mpi4py.futures requires a separate launcher script to create the MPI environment and launch the main script.
    
    When using mpi4py.futures, the MPIPoolExecutor needs to be started in a separate process. This is because the MPIPoolExecutor must be created and submitted to tasks within an if __name__ == '__main__': block to prevent recursive creation of MPIPoolExecutor instances.
    
    Also, in the bash commands:
    - Modify the bash code to remove or clear the python-test directory, if it already exists. This ensures a clean environment for every test run.
    - Simplify the scp file transfer code, to one line.
    EwoutH committed May 31, 2023
    Configuration menu
    Copy the full SHA
    e11493a View commit details
    Browse the repository at this point in the history

Commits on Jun 1, 2023

  1. scripts: Merge model and launcher, ensure they work when cores < jobs

    Make sure that the scripts also work when the number of cores (in this case 10) is smaller than the number of jobs (20). In the EMAworkbench, when testing a million experiments, there won't be always a million cores available.
    EwoutH committed Jun 1, 2023
    Configuration menu
    Copy the full SHA
    1becad4 View commit details
    Browse the repository at this point in the history

Commits on Jun 7, 2023

  1. Configuration menu
    Copy the full SHA
    eae8f31 View commit details
    Browse the repository at this point in the history

Commits on Aug 29, 2023

  1. Configuration menu
    Copy the full SHA
    63ce62c View commit details
    Browse the repository at this point in the history
  2. SLURM script: Let MPIPoolExecutor manage processes

    The error you're encountering is related to how mpi4py's `MPIPoolExecutor` works under the hood.
    
    When you're launching your script with `mpirun`, it's spawning multiple MPI processes. If you use `MPIPoolExecutor`, it tries to spawn additional processes (or threads, in this case) for each MPI process. This is why you see a conflict: you're essentially trying to spawn processes on cores that are already allocated, leading to the "All nodes which are allocated for this job are already filled" error.
    
    Here's how to address this:
    
    1. **Avoid Nested Parallelism**: Don't combine `mpirun` with `MPIPoolExecutor`. Either use the typical MPI approach (using send/receive) or use the `MPIPoolExecutor`.
    
    2. **Using `MPIPoolExecutor` without `mpirun`**: The way the `MPIPoolExecutor` works is that you run your Python script normally (i.e., without `mpirun`), and the `MPIPoolExecutor` will manage the creation and distribution of tasks across the MPI processes.
    
       Adjust your SLURM script:
       ```bash
       #!/bin/bash
       # Other directives...
    
       # You don't need mpirun here
       python my_model.py > py_test.log
       ```
    
    3. **Adjust the Code**: In your code, since you're launching without `mpirun`, you don't have to worry about the world size or checking against the number of jobs. The `MPIPoolExecutor` will automatically manage the tasks for you.
    
    4. **Optionally, use `MPI.COMM_WORLD.Spawn`**: If you want more control, you can consider using `MPI.COMM_WORLD.Spawn()` to launch the worker processes instead of using the `MPIPoolExecutor`.
    
    Lastly, always be cautious when working on an HPC environment. Nested parallelism can exhaust resources and potentially harm other users' jobs. Always test on a smaller subset of cores/nodes and make sure to monitor your jobs to ensure they're behaving as expected.
    EwoutH committed Aug 29, 2023
    Configuration menu
    Copy the full SHA
    1a94011 View commit details
    Browse the repository at this point in the history

Commits on Aug 30, 2023

  1. script: Use mpiexec

    EwoutH committed Aug 30, 2023
    Configuration menu
    Copy the full SHA
    0a61510 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    09c3f02 View commit details
    Browse the repository at this point in the history

Commits on Aug 31, 2023

  1. Configuration menu
    Copy the full SHA
    597de6a View commit details
    Browse the repository at this point in the history
  2. MPIEvaluator: pass model

    EwoutH committed Aug 31, 2023
    Configuration menu
    Copy the full SHA
    00d2c1f View commit details
    Browse the repository at this point in the history
  3. MPIEvaluator: Model passing

    EwoutH committed Aug 31, 2023
    Configuration menu
    Copy the full SHA
    c72fc6c View commit details
    Browse the repository at this point in the history
  4. MPIEvaluator: Model passing

    EwoutH committed Aug 31, 2023
    Configuration menu
    Copy the full SHA
    469ffe6 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    6fb15e3 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    806dbb8 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    1c6289f View commit details
    Browse the repository at this point in the history

Commits on Sep 5, 2023

  1. Configuration menu
    Copy the full SHA
    8abf05c View commit details
    Browse the repository at this point in the history

Commits on Sep 14, 2023

  1. Update DelftBlue commands.txt

    EwoutH committed Sep 14, 2023
    Configuration menu
    Copy the full SHA
    2f9b934 View commit details
    Browse the repository at this point in the history
  2. Create WSL commands.txt

    Create some commands to use locally using WSL2 with Ubuntu
    EwoutH committed Sep 14, 2023
    Configuration menu
    Copy the full SHA
    247c576 View commit details
    Browse the repository at this point in the history

Commits on Sep 18, 2023

  1. Refactor MPIEvaluator to optimize model handling and runner instantia…

    …tion
    
    This commit introduces two primary architectural modifications to the MPIEvaluator class, aimed at improving efficiency when running experiments on HPC systems:
    
    1. **Singleton ExperimentRunner**:
        Previously, the `ExperimentRunner` was instantiated for every individual experiment. This approach could introduce unnecessary overhead, especially when dealing with a large number of experiments. Instead, we've adopted a pattern where the `ExperimentRunner` is instantiated once and shared among all the worker processes in the MPI pool. This is achieved using an initializer function `mpi_initializer` which sets up a global `ExperimentRunner` for all the worker processes.
    
    2. **Optimized Model Packing**:
        Before this change, all models were packed and sent with each experiment. This was potentially inefficient, especially when the size of the model objects was large. Now, we've altered the architecture to send only the model name with each experiment. Since the `ExperimentRunner` already has access to all models (being initialized once with all of them), it can easily fetch the necessary model using the provided model name.
    
    The primary motivation behind these changes is to reduce the overhead related to object instantiation and data transfer, especially when running experiments on large-scale HPC systems with SLURM.
    EwoutH committed Sep 18, 2023
    Configuration menu
    Copy the full SHA
    b8bf0e7 View commit details
    Browse the repository at this point in the history

Commits on Sep 21, 2023

  1. Scripts: Getting NetLogo working on WSL

    6.2.2 works, 6.3.0 give very strange errors pointing to Java ASM incompatibility. Possibly an error in PyNetLogo. See NetLogo/NetLogo#2171
    EwoutH committed Sep 21, 2023
    Configuration menu
    Copy the full SHA
    c09deda View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    bd75003 View commit details
    Browse the repository at this point in the history
  3. Create DelftBlue commands netlogo.txt

    Identical to the current DelftBlue commands.txt, to be able to track changes
    EwoutH committed Sep 21, 2023
    Configuration menu
    Copy the full SHA
    85b46fb View commit details
    Browse the repository at this point in the history
  4. Update DelftBlue commands netlogo.txt

    Rewrite for NetLogo.
    
    Current problem: Extracting with tar -xzf NetLogo-6.2.2-64.tgz takes extremely long.
    EwoutH committed Sep 21, 2023
    Configuration menu
    Copy the full SHA
    49e02f3 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    bae7a77 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    b17dd76 View commit details
    Browse the repository at this point in the history
  7. Update model.py

    EwoutH committed Sep 21, 2023
    Configuration menu
    Copy the full SHA
    b0f9eaa View commit details
    Browse the repository at this point in the history
  8. Update netlogo.py

    Update paths used in NetLogo.py for DelftBlue
    EwoutH committed Sep 21, 2023
    Configuration menu
    Copy the full SHA
    23fb912 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    9bbecc6 View commit details
    Browse the repository at this point in the history

Commits on Oct 9, 2023

  1. evaluators.py: Correctly do conditional import

    If you only need it in one method of one class, just import it in that one method of that one class (and not in the __init__).
    
    Makes sure mpi4py is only required when the MPIEvaluator is used. Challenging part will be conditionally importing MPI to establish the rank based on which evaluator is used.
    EwoutH committed Oct 9, 2023
    Configuration menu
    Copy the full SHA
    7960054 View commit details
    Browse the repository at this point in the history

Commits on Oct 13, 2023

  1. Configuration menu
    Copy the full SHA
    33165bc View commit details
    Browse the repository at this point in the history
  2. evaluators.py: Add logging statements to MPIEvaluator

    The logging statements in the run_experiment_mpi() function don't adhere to the model logger level and formatting. TODO: Research and fix this.
    EwoutH committed Oct 13, 2023
    Configuration menu
    Copy the full SHA
    1c986e6 View commit details
    Browse the repository at this point in the history

Commits on Oct 27, 2023

  1. Configuration menu
    Copy the full SHA
    e8a11f0 View commit details
    Browse the repository at this point in the history
  2. Fix propagation of logging levels in MPIEvaluator and related modules

    - Introduced logging level propagation to the `MPIEvaluator` to ensure consistent logging across MPI processes.
    - Adjusted the `mpi_initializer` function to accept a logger level and configure the logging for each worker accordingly.
    - Changed the logging statements within `MPIEvaluator` and `run_experiment_mpi` from DEBUG to INFO for better visibility.
    - Removed a redundant TODO regarding logging in `run_experiment_mpi`.
    - Modified the `log_to_stderr` method in `ema_logging.py` to allow for setting logging levels on all root module loggers.
    - Simplified the logging setup in the `ema_model.py` script, and ensured proper propagation of logging levels.
    EwoutH committed Oct 27, 2023
    Configuration menu
    Copy the full SHA
    89817ae View commit details
    Browse the repository at this point in the history