Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: parallel inference without slurm #112

Open
cathalobrien opened this issue Jan 23, 2025 · 4 comments
Open

Feature: parallel inference without slurm #112

cathalobrien opened this issue Jan 23, 2025 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@cathalobrien
Copy link
Contributor

Is your feature request related to a problem? Please describe.

Currently Anemoi Inference requires Slurm to run in parallel (see PR #108).

Slurm is used in 3 ways:

  • You must call 'srun anemoi-inference run ...' to launch the parallel processes
  • When running in parallel, Anemoi Inference reads Slurm env vars 'SLURM_LOCALID' (to match a GPU to a process), SLURM_PROCID (to determine process 0), and SLURM_NTASKS (to determine the total number of parallel processes)
  • If a 'MASTER_ADDR' env var is not set, slurms 'scontrol' program is used to determine the master address

Adding the option to run in parallel without slurm on a single node would simplify debugging and make it possible to run on systems without Slurm installed (i.e. some cloud setups)

Describe the solution you'd like

To run in parallel without Slurm on a single node should be straightforward:

  • Instead of launching the parallel processes with 'srun anemoi-inference run ...', we could launch one anemoi-inference process as normal and have it spawn the required number of subprocesses
  • We could pass world size by the config and determine LOCALID/PROCID based on the pids of the spawned processes
  • MASTER_ADDR can just be 'localhost' when running on a single node

Describe alternatives you've considered

No response

Additional context

No response

Organisation

ECMWF

@cathalobrien cathalobrien added the enhancement New feature or request label Jan 23, 2025
@cathalobrien cathalobrien self-assigned this Jan 23, 2025
@rosinaderks
Copy link

Additional context: At KNMI we are also interested in the option to run parallel inference without slurm due to our cloud setup (AWS). In the coming weeks we will test slurm and, if possible, the combination with AWS ParallelCluster. However, the option to run parallel inference without slurm would be much more interesting and easier to implement for us.

@cathalobrien
Copy link
Contributor Author

thanks for the use-case! I'll move this up the list of things to do

@cathalobrien
Copy link
Contributor Author

Hi @rosinaderks We have a PR for this now at #121 :)

@rosinaderks
Copy link

Thanks for the quick response and PR!

@cathalobrien cathalobrien moved this to Under Review in Anemoi-dev Feb 4, 2025
@cathalobrien cathalobrien changed the title Add the option to run parallel inference without slurm parallel inference without slurm Feb 4, 2025
@cathalobrien cathalobrien changed the title parallel inference without slurm Feature: parallel inference without slurm Feb 4, 2025
@cathalobrien cathalobrien removed this from Anemoi-dev Feb 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants