Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quick slurm fix #435

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

quick slurm fix #435

wants to merge 1 commit into from

Conversation

tulga-rdn
Copy link
Collaborator

@tulga-rdn tulga-rdn commented Dec 19, 2024

This PR is a quick fix for the robustness of the SLURM environment check. As @DavideTisi and I observed in #422 , when running the training after SSHing into a node reserved by salloc, the code throws an error. I think this occurs because reserving resources, SSHing to the node and running a job through a terminal session on that node is not a SLURM-managed process. So, this process has a job ID (since we got the node from some job), but not a process ID, so it's basically the same as running the training on a local workstation.

Considering this, I changed the is_slurm function, which will now return True only if the environment has both a job ID and a process ID.

Contributor (creator of pull-request) checklist

  • Tests updated (for new features and bugfixes)?
  • Documentation updated (for new features)?
  • Issue referenced (for PRs that solve an issue)?

📚 Documentation preview 📚: https://metatrain--435.org.readthedocs.build/en/435/

@tulga-rdn
Copy link
Collaborator Author

Oh, and can I have some more permissions please? I didn't have enough permissions to create a branch, I think

@DavideTisi
Copy link
Contributor

so for me the PR is fine, for some reason I cannot approve the PR. Anyway we can wait for what @frostedoyster has to say, and also he probably can add you in the repo

Copy link
Contributor

@DavideTisi DavideTisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok after refreshing the page he gave me the option to do the review. my comment are the same as before, we can wait for filippo's approval

@frostedoyster
Copy link
Collaborator

Invited to the repo, thanks for the fix!

@frostedoyster
Copy link
Collaborator

Tests failing due to this library which is a dependency of spherical
https://pypi.org/project/spinsfast/#history

@moble
Copy link

moble commented Dec 20, 2024

TL;DR: Either

  1. bump 3.9 to 3.10 here, or
  2. change this line to sudo apt-get install -y zsh fftw-dev.

The problem is that spinsfast should still function on various older versions of python and numpy, so I don't want to restrict the compatible versions, but I can no longer generate wheels (pre-built binaries) for anything older than python 3.10 — just because of the various combinations of github runners, numpy compatibility, cibuildwheel support, etc.  I had to update the code itself a couple days ago because of numpy nonsense, so the newest version just doesn't have wheels for older versions.

I would guess that this is the first time pip is actually trying to build spinsfast for you, rather than just downloading the wheels.  You can see from the error message that it can't find the FFTW library, which just means you should install it.  There's a line in your github action where you do sudo apt-get install -y zsh; just add fftw-dev to the end of that line.  (You might want to only do this if matrix.python-version=="3.9", but it wouldn't do any harm for the other ubuntu case, and would make it less likely to break next time I have to bump the lower version.)

The other possibility is to bump your minimum python version to 3.10.  Numpy itself dropped support for python <3.10 back on Apr 5 of this year.  And I can still build wheels for 3.10, so this problem shouldn't crop up.  But I don't know if that's an option for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants