You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We want to exit the job gracefully by saving a checkpoint and exiting the script when reaching the Slurm time limit of the job. In PyTorch lightning we could easily do that by catching a given signal we sent with slurm via #SBATCH --signal=SIGUSR2@600 but we can't easily do this anymore as PyTorch's torchrun is capturing those signals.
I suggest using #SBATCH --signal=B:SIGUSR2@600 to send the signal to the sbatch script and from this write a trigger to gracefully exit the run.
The text was updated successfully, but these errors were encountered:
We want to exit the job gracefully by saving a checkpoint and exiting the script when reaching the Slurm time limit of the job. In PyTorch lightning we could easily do that by catching a given signal we sent with slurm via
#SBATCH --signal=SIGUSR2@600
but we can't easily do this anymore as PyTorch'storchrun
is capturing those signals.I suggest using
#SBATCH --signal=B:SIGUSR2@600
to send the signal to the sbatch script and from this write a trigger to gracefully exit the run.The text was updated successfully, but these errors were encountered: