You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SLURM can automatically requeue jobs (e.g. on node failure or preemption of a higher priority job: https://slurm.schedmd.com/sbatch.html#OPT_requeue). In general this is similar to the resume function we have in sisyphus with the added bonus that jobs keep their priority.
If this is enabled (i.e. if you don't specify the flag in sbatch the default is defined by the slurm.conf), this causes a few issues:
As the job id does not change, the log file of the previous run is overwritten (this actually triggered me to look into this)
The nicest option would be to be able to create separate files under engine/ for each run (that's the behaviour as without requeue as the slurm job id changes). But this is afaik not possible as the restart number is not available in the corresponding file pattern: https://slurm.schedmd.com/sbatch.html#SECTION_FILENAME-PATTERN
This would be easy to fix by always setting --no-requeue (https://slurm.schedmd.com/sbatch.html#OPT_no-requeue) for non-resumable tasks. But, this would require to pass the information whether a task is resumable to the submit call function
what would also potentially break custom engine implementations (but should be an easy fix and I only know of a single custom engine implementation by @Zettelkasten). <-- my preferred solution
Alternatively, both issues would be fixed by always setting --no-requeue but then we would loose the advantages for resumable jobs.
Are there any other opinions? If not I'd create a PR for the two fixes.
The text was updated successfully, but these errors were encountered:
For me your proposed options sound valid. For the log file I see no issues at all, for the second one this maybe needs an additional look but should also be fine.
The local engine already appends it's log to the last log file. I think it's a good idea to have a clearly visible separation between different entries similar to this:
SLURM can automatically requeue jobs (e.g. on node failure or preemption of a higher priority job: https://slurm.schedmd.com/sbatch.html#OPT_requeue). In general this is similar to the resume function we have in sisyphus with the added bonus that jobs keep their priority.
If this is enabled (i.e. if you don't specify the flag in sbatch the default is defined by the slurm.conf), this causes a few issues:
engine/
for each run (that's the behaviour as without requeue as the slurm job id changes). But this is afaik not possible as the restart number is not available in the corresponding file pattern: https://slurm.schedmd.com/sbatch.html#SECTION_FILENAME-PATTERN--open-mode=append
https://slurm.schedmd.com/sbatch.html#OPT_open-mode so that the previous log file is kept in the same file <-- my preferred solution--no-requeue
(https://slurm.schedmd.com/sbatch.html#OPT_no-requeue) for non-resumable tasks. But, this would require to pass the information whether a task is resumable to the submit call functionsisyphus/sisyphus/engine.py
Line 36 in a22e923
Alternatively, both issues would be fixed by always setting
--no-requeue
but then we would loose the advantages for resumable jobs.Are there any other opinions? If not I'd create a PR for the two fixes.
The text was updated successfully, but these errors were encountered: