Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm main process code kills mtt export in a slurm environment #422

Open
frostedoyster opened this issue Dec 10, 2024 · 4 comments
Open
Assignees
Labels
Infrastructure: Miscellaneous General infrastructure issues Priority: Medium Important issues to address after high priority.

Comments

@frostedoyster
Copy link
Collaborator

No description provided.

@frostedoyster frostedoyster self-assigned this Dec 10, 2024
@frostedoyster frostedoyster added Priority: Medium Important issues to address after high priority. Infrastructure: Miscellaneous General infrastructure issues labels Dec 10, 2024
@DavideTisi
Copy link
Contributor

I put a descrition for general reference.

I tried to do mtt export best_model.ckpt -o best_model.pt on a compute node of kuma

and I got this error:

(pet-mad-venv) [tisi@kh002 training_paolo_parameters]$ mtt export best_model.ckpt -o best_model.pt
Traceback (most recent call last):
  File "/work/cosmo/bigi/pet-mad-venv/bin/mtt", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/work/cosmo/bigi/pet-mad-venv/lib/python3.11/site-packages/metatrain/__main__.py", line 92, in main
    with setup_logging(logger, log_file=log_file, level=level):
  File "/ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/python-3.11.7-wpgsyqek7spdydbmic66srcfb3v7kzoi/lib/python3.11/contextlib.py", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File "/work/cosmo/bigi/pet-mad-venv/lib/python3.11/site-packages/metatrain/utils/logging.py", line 231, in setup_logging
    if not is_main_process():
           ^^^^^^^^^^^^^^^^^
  File "/work/cosmo/bigi/pet-mad-venv/lib/python3.11/site-packages/metatrain/utils/distributed/logging.py", line 6, in is_main_process
    return is_slurm_main_process()
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/cosmo/bigi/pet-mad-venv/lib/python3.11/site-packages/metatrain/utils/distributed/slurm.py", line 11, in is_slurm_main_process
    return os.environ["SLURM_PROCID"] == "0"
           ~~~~~~~~~~^^^^^^^^^^^^^^^^
  File "<frozen os>", line 679, in __getitem__
KeyError: 'SLURM_PROCID'
(pet-mad-venv) [tisi@kh002 training_paolo_parameters]$ mtt export best_model.ckpt -o best_model.pt
Traceback (most recent call last):
  File "/work/cosmo/bigi/pet-mad-venv/bin/mtt", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/work/cosmo/bigi/pet-mad-venv/lib/python3.11/site-packages/metatrain/__main__.py", line 92, in main
    with setup_logging(logger, log_file=log_file, level=level):
  File "/ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/python-3.11.7-wpgsyqek7spdydbmic66srcfb3v7kzoi/lib/python3.11/contextlib.py", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File "/work/cosmo/bigi/pet-mad-venv/lib/python3.11/site-packages/metatrain/utils/logging.py", line 231, in setup_logging
    if not is_main_process():
           ^^^^^^^^^^^^^^^^^
  File "/work/cosmo/bigi/pet-mad-venv/lib/python3.11/site-packages/metatrain/utils/distributed/logging.py", line 6, in is_main_process
    return is_slurm_main_process()
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/cosmo/bigi/pet-mad-venv/lib/python3.11/site-packages/metatrain/utils/distributed/slurm.py", line 11, in is_slurm_main_process
    return os.environ["SLURM_PROCID"] == "0"
           ~~~~~~~~~~^^^^^^^^^^^^^^^^
  File "<frozen os>", line 679, in __getitem__
KeyError: 'SLURM_PROCID'

@tulga-rdn
Copy link
Collaborator

Hi, what's the fix for this? 😅

@DavideTisi
Copy link
Contributor

so only god know why there is this error.
What happen is that I did salloc and then ssh to connect to the node.
Doing this triggers the error.
Sending a job via sbatch or connecting interactively via Sinteract instead do not trigger the error.
As I said only god know why.

TLDR or submit a job via sbatch or use sinteract

@tulga-rdn
Copy link
Collaborator

I think salloc results in an environment with no SLURM_PROCID.

I just commented out the line that checks if the environment is SLURM or not 😅.

@tulga-rdn tulga-rdn mentioned this issue Dec 19, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Infrastructure: Miscellaneous General infrastructure issues Priority: Medium Important issues to address after high priority.
Projects
None yet
Development

No branches or pull requests

3 participants