You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Currently Anemoi Inference requires Slurm to run in parallel (see PR #108).
Slurm is used in 3 ways:
You must call 'srun anemoi-inference run ...' to launch the parallel processes
When running in parallel, Anemoi Inference reads Slurm env vars 'SLURM_LOCALID' (to match a GPU to a process), SLURM_PROCID (to determine process 0), and SLURM_NTASKS (to determine the total number of parallel processes)
If a 'MASTER_ADDR' env var is not set, slurms 'scontrol' program is used to determine the master address
Adding the option to run in parallel without slurm on a single node would simplify debugging and make it possible to run on systems without Slurm installed (i.e. some cloud setups)
Describe the solution you'd like
To run in parallel without Slurm on a single node should be straightforward:
Instead of launching the parallel processes with 'srun anemoi-inference run ...', we could launch one anemoi-inference process as normal and have it spawn the required number of subprocesses
We could pass world size by the config and determine LOCALID/PROCID based on the pids of the spawned processes
MASTER_ADDR can just be 'localhost' when running on a single node
Describe alternatives you've considered
No response
Additional context
No response
Organisation
ECMWF
The text was updated successfully, but these errors were encountered:
Additional context: At KNMI we are also interested in the option to run parallel inference without slurm due to our cloud setup (AWS). In the coming weeks we will test slurm and, if possible, the combination with AWS ParallelCluster. However, the option to run parallel inference without slurm would be much more interesting and easier to implement for us.
Is your feature request related to a problem? Please describe.
Currently Anemoi Inference requires Slurm to run in parallel (see PR #108).
Slurm is used in 3 ways:
Adding the option to run in parallel without slurm on a single node would simplify debugging and make it possible to run on systems without Slurm installed (i.e. some cloud setups)
Describe the solution you'd like
To run in parallel without Slurm on a single node should be straightforward:
Describe alternatives you've considered
No response
Additional context
No response
Organisation
ECMWF
The text was updated successfully, but these errors were encountered: