You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is actually running? I see that the logfile is being updated.
Then I did rm alias/exp2023_04_25_rf/espnet/v6-11gb-f32-bs8k-accgrad1-mgpu4-pavg100-wd1e_4-lrlin1e_5_558k-EBranchformer-dynGradAccumV2/train/error.run.1, and restarted the Sisyphus manager. And then it showed the job correctly as running again:
...
[2024-01-13 11:18:48,049] WARNING: Job marked as running but logging file has not been updated: Job<alias/exp2023_04_25_rf/espnet/v6-11gb-f32-bs8k-accgrad1-mgpu4-pavg100-wd1e_4-lrlin1e_5_558k-EBranchformer-dynGradAccumV2/train work/i6_core/returnn/training/ReturnnTrainingJob.LgJ1IEE3O1g8> assume it is running
[2024-01-13 11:19:01,303] WARNING: Job marked as running but logging file has not been updated: Job<alias/exp2023_04_25_rf/espnet/v6-11gb-f32-bs8k-accgrad1-mgpu4-pavg100-wd1e_4-lrlin1e_5_558k-EBranchformer-dynGradAccumV2/train work/i6_core/returnn/training/ReturnnTrainingJob.LgJ1IEE3O1g8> assume it is running
[2024-01-13 11:19:45,088] WARNING: Job marked as running but logging file has not been updated: Job<alias/exp2023_04_25_rf/espnet/v6-11gb-f32-bs8k-accgrad1-mgpu4-pavg100-wd1e_4-lrlin1e_5_558k-EBranchformer-dynGradAccumV2/train work/i6_core/returnn/training/ReturnnTrainingJob.LgJ1IEE3O1g8> assume it is running
[2024-01-13 11:22:12,427] WARNING: Job marked as running but logging file has not been updated: Job<alias/exp2023_04_25_rf/espnet/v6-11gb-f32-bs8k-accgrad1-mgpu4-pavg100-wd1e_4-lrlin1e_5_558k-EBranchformer-dynGradAccumV2/train work/i6_core/returnn/training/ReturnnTrainingJob.LgJ1IEE3O1g8> assume it is running
[2024-01-13 11:22:25,680] WARNING: Job marked as running but logging file has not been updated: Job<alias/exp2023_04_25_rf/espnet/v6-11gb-f32-bs8k-accgrad1-mgpu4-pavg100-wd1e_4-lrlin1e_5_558k-EBranchformer-dynGradAccumV2/train work/i6_core/returnn/training/ReturnnTrainingJob.LgJ1IEE3O1g8> assume it is running
[2024-01-13 11:22:38,934] WARNING: Job marked as running but logging file has not been updated: Job<alias/exp2023_04_25_rf/espnet/v6-11gb-f32-bs8k-accgrad1-mgpu4-pavg100-wd1e_4-lrlin1e_5_558k-EBranchformer-dynGradAccumV2/train work/i6_core/returnn/training/ReturnnTrainingJob.LgJ1IEE3O1g8> assume it is running
[2024-01-13 11:22:52,403] INFO: Experiment directory: /u/zeyer/setups/combined/2021-05-31 Call: ./sis m recipe/i6_experiments/users/zeyer/experiments/exp2023_04_25_rf/i6.py
...
Looking into jobs engine:
ls -la alias/exp2023_04_25_rf/espnet/v6-11gb-f32-bs8k-accgrad1-mgpu4-pavg100-wd1e_4-lrlin1e_5_558k-EBranchformer-dynGrad
AccumV2/train/engine
total 196018
drwxr-xr-x 2 zeyer assi 12 Jan 13 12:23 .
drwxr-xr-x 6 zeyer assi 15 Jan 14 00:05 ..
-rw-r--r-- 1 zeyer assi 104867035 Jan 6 04:53 i6_core.returnn.training.ReturnnTrainingJob.LgJ1IEE3O1g8.run.3967412.1
-rw-r--r-- 1 zeyer assi 207302157 Jan 9 20:15 i6_core.returnn.training.ReturnnTrainingJob.LgJ1IEE3O1g8.run.4000241.1
-rw-r--r-- 1 zeyer assi 10852444 Jan 9 23:46 i6_core.returnn.training.ReturnnTrainingJob.LgJ1IEE3O1g8.run.4058657.1
-rw-r--r-- 1 zeyer assi 17455053 Jan 10 08:08 i6_core.returnn.training.ReturnnTrainingJob.LgJ1IEE3O1g8.run.4061471.1
-rw-r--r-- 1 zeyer assi 76535987 Jan 11 11:10 i6_core.returnn.training.ReturnnTrainingJob.LgJ1IEE3O1g8.run.4064307.1
-rw-r--r-- 1 zeyer assi 2580756 Jan 11 11:55 i6_core.returnn.training.ReturnnTrainingJob.LgJ1IEE3O1g8.run.4090298.1
-rw-r--r-- 1 zeyer assi 23976868 Jan 11 19:14 i6_core.returnn.training.ReturnnTrainingJob.LgJ1IEE3O1g8.run.4090675.1
-rw-r--r-- 1 zeyer assi 373200 Jan 12 15:10 i6_core.returnn.training.ReturnnTrainingJob.LgJ1IEE3O1g8.run.4103762.1
-rw-r--r-- 1 zeyer assi 62070882 Jan 13 12:22 i6_core.returnn.training.ReturnnTrainingJob.LgJ1IEE3O1g8.run.4110798.1
-rw-r--r-- 2 zeyer assi 36560112 Jan 14 00:14 i6_core.returnn.training.ReturnnTrainingJob.LgJ1IEE3O1g8.run.4119424.1
The last job there:
--------------------- Slurm Task Prolog ------------------------
Job ID: 4119424
Job name: i6_core.returnn.training.ReturnnTrainingJob.LgJ1IEE3O1g8.run
Host: cn-230
Date: Sat Jan 13 12:23:13 PM CET 2024
User: zeyer
Slurm account: hlt
Slurm partition: gpu_11gb
Work dir:
------------------
Node usage:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
4119424_1 gpu_11gb i6_core. zeyer R 0:00 1 cn-230
------------------
Show launch script with:
sacct -B -j
------------------
--------------------- Slurm Task Prolog ------------------------
...
This is still running.
The previous job:
--------------------- Slurm Task Epilog ------------------------
Job ID: 4110798
Time: Sat Jan 13 12:22:47 PM CET 2024
Elapsed Time: 21:13:16
Billing per second for TRES: billing=376,cpu=16,gres/gpu=4,mem=60G,node=1
Show resource usage with e.g.:
sacct -j 4110798 -o Elapsed,TotalCPU,UserCPU,SystemCPU,MaxRSS,ReqTRES%60,MaxDiskRead,MaxDiskWrite
--------------------- Slurm Task Epilog ------------------------
slurmstepd-cn-230: error: Detected 1 oom-kill event(s) in StepId=4110798.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
I'm not really sure how the job got into the error state.
The text was updated successfully, but these errors were encountered:
I think seeing an error.run.1 file on disk takes precedence over seeing the task running in slurm (or sge for that matter).
Now how did we get here? Was this maybe another case of job submitted to queue while another instance of the same job was still running? I imagine if the first instance errored while the second one was still pending then we would arrive at this situation.
Or maybe was the second instance submitted after the first instance crashed but before the manager found the error file?
I can confirm that the manager reports ans error as soon as the error file is found and also agree that the most likely scenario is that two jobs were submitted and one crashed.
That the manager restarted the job before it found the error file is unlikely. I would not rule it out, but there are steps to prevent this. Otherwise this would probably frequently happen to finished jobs as well.
Job in error state:
Is actually running? I see that the logfile is being updated.
Then I did
rm alias/exp2023_04_25_rf/espnet/v6-11gb-f32-bs8k-accgrad1-mgpu4-pavg100-wd1e_4-lrlin1e_5_558k-EBranchformer-dynGradAccumV2/train/error.run.1
, and restarted the Sisyphus manager. And then it showed the job correctly as running again:Looking at the Sisyphus log a bit earlier: The last message before the error:
Before that, I see this:
Looking into jobs
engine
:The last job there:
This is still running.
The previous job:
I'm not really sure how the job got into the error state.
The text was updated successfully, but these errors were encountered: