Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job marked in error state is actually running #174

Open
albertz opened this issue Jan 13, 2024 · 2 comments
Open

Job marked in error state is actually running #174

albertz opened this issue Jan 13, 2024 · 2 comments

Comments

@albertz
Copy link
Member

albertz commented Jan 13, 2024

Job in error state:

[2024-01-13 23:03:10,182] ERROR: error: Job<alias/exp2023_04_25_rf/espnet/v6-11gb-f32-bs8k-accgrad1-mgpu4-pavg100-wd1e_4-lrlin1e_5_558k-EBranchformer-dynGradAccumV2/train work/i6_core/returnn/training/ReturnnTrainingJob.LgJ1IEE3O1g8>

Is actually running? I see that the logfile is being updated.

Then I did rm alias/exp2023_04_25_rf/espnet/v6-11gb-f32-bs8k-accgrad1-mgpu4-pavg100-wd1e_4-lrlin1e_5_558k-EBranchformer-dynGradAccumV2/train/error.run.1, and restarted the Sisyphus manager. And then it showed the job correctly as running again:

[2024-01-13 23:06:27,019] INFO: running: Job<alias/exp2023_04_25_rf/espnet/v6-11gb-f32-bs8k-accgrad1-mgpu4-pavg100-wd1e_4-lrlin1e_5_558k-EBranchformer-dynGradAccumV2/train work/i6_core/returnn/training/ReturnnTrainingJob.LgJ1IEE3O1g8> {ep 282/500}  

Looking at the Sisyphus log a bit earlier: The last message before the error:

[2024-01-13 11:22:53,101] WARNING: interrupted_resumable: Job<alias/exp2023_04_25_rf/espnet/v6-11gb-f32-bs8k-accgrad1-mgpu4-pavg100-wd1e_4-lrlin1e_5_558k-EBranchformer-dynGradAccumV2/train work/i6_core/returnn/training/ReturnnTrainingJob.LgJ1IEE3O1g8>                                                                     
[2024-01-13 11:22:53,101] INFO: interrupted_resumable(1) queue(2) running(6) waiting(455)
[2024-01-13 11:22:56,496] INFO: Submit to queue: work/i6_core/returnn/training/ReturnnTrainingJob.LgJ1IEE3O1g8 run [1]                                          
[2024-01-13 11:22:56,511] INFO: Submitted with job_id: ['4119424'] i6_core.returnn.training.ReturnnTrainingJob.LgJ1IEE3O1g8.run                                 
[2024-01-13 11:22:56,827] INFO: Experiment directory: /u/zeyer/setups/combined/2021-05-31      Call: ./sis m recipe/i6_experiments/users/zeyer/experiments/exp2023_04_25_rf/i6.py                                                                                                                                               
...
[2024-01-13 11:22:56,828] INFO: queue: Job<alias/exp2023_04_25_rf/espnet/v6-11gb-f32-bs8k-accgrad1-mgpu4-pavg100-wd1e_4-lrlin1e_5_558k-EBranchformer-dynGradAccumV2/train work/i6_core/returnn/training/ReturnnTrainingJob.LgJ1IEE3O1g8>                                                                                        
...
[2024-01-13 11:22:57,163] INFO: queue(3) running(6) waiting(455)                                                                                                
[2024-01-13 11:23:07,106] INFO: Experiment directory: /u/zeyer/setups/combined/2021-05-31      Call: ./sis m recipe/i6_experiments/users/zeyer/experiments/exp2023_04_25_rf/i6.py                                                                                                                                               
[2024-01-13 11:23:07,106] ERROR: error: Job<alias/exp2023_04_25_rf/espnet/v6-11gb-f32-bs8k-accgrad1-mgpu4-pavg100-wd1e_4-lrlin1e_5_558k-EBranchformer-dynGradAccumV2/train work/i6_core/returnn/training/ReturnnTrainingJob.LgJ1IEE3O1g8>                                                                                       
...

Before that, I see this:

...
[2024-01-13 11:18:48,049] WARNING: Job marked as running but logging file has not been updated: Job<alias/exp2023_04_25_rf/espnet/v6-11gb-f32-bs8k-accgrad1-mgpu4-pavg100-wd1e_4-lrlin1e_5_558k-EBranchformer-dynGradAccumV2/train work/i6_core/returnn/training/ReturnnTrainingJob.LgJ1IEE3O1g8> assume it is running          
[2024-01-13 11:19:01,303] WARNING: Job marked as running but logging file has not been updated: Job<alias/exp2023_04_25_rf/espnet/v6-11gb-f32-bs8k-accgrad1-mgpu4-pavg100-wd1e_4-lrlin1e_5_558k-EBranchformer-dynGradAccumV2/train work/i6_core/returnn/training/ReturnnTrainingJob.LgJ1IEE3O1g8> assume it is running          
[2024-01-13 11:19:45,088] WARNING: Job marked as running but logging file has not been updated: Job<alias/exp2023_04_25_rf/espnet/v6-11gb-f32-bs8k-accgrad1-mgpu4-pavg100-wd1e_4-lrlin1e_5_558k-EBranchformer-dynGradAccumV2/train work/i6_core/returnn/training/ReturnnTrainingJob.LgJ1IEE3O1g8> assume it is running          
[2024-01-13 11:22:12,427] WARNING: Job marked as running but logging file has not been updated: Job<alias/exp2023_04_25_rf/espnet/v6-11gb-f32-bs8k-accgrad1-mgpu4-pavg100-wd1e_4-lrlin1e_5_558k-EBranchformer-dynGradAccumV2/train work/i6_core/returnn/training/ReturnnTrainingJob.LgJ1IEE3O1g8> assume it is running          
[2024-01-13 11:22:25,680] WARNING: Job marked as running but logging file has not been updated: Job<alias/exp2023_04_25_rf/espnet/v6-11gb-f32-bs8k-accgrad1-mgpu4-pavg100-wd1e_4-lrlin1e_5_558k-EBranchformer-dynGradAccumV2/train work/i6_core/returnn/training/ReturnnTrainingJob.LgJ1IEE3O1g8> assume it is running          
[2024-01-13 11:22:38,934] WARNING: Job marked as running but logging file has not been updated: Job<alias/exp2023_04_25_rf/espnet/v6-11gb-f32-bs8k-accgrad1-mgpu4-pavg100-wd1e_4-lrlin1e_5_558k-EBranchformer-dynGradAccumV2/train work/i6_core/returnn/training/ReturnnTrainingJob.LgJ1IEE3O1g8> assume it is running          
[2024-01-13 11:22:52,403] INFO: Experiment directory: /u/zeyer/setups/combined/2021-05-31      Call: ./sis m recipe/i6_experiments/users/zeyer/experiments/exp2023_04_25_rf/i6.py
...

Looking into jobs engine:

ls -la alias/exp2023_04_25_rf/espnet/v6-11gb-f32-bs8k-accgrad1-mgpu4-pavg100-wd1e_4-lrlin1e_5_558k-EBranchformer-dynGrad
AccumV2/train/engine
total 196018
drwxr-xr-x 2 zeyer assi        12 Jan 13 12:23 .
drwxr-xr-x 6 zeyer assi        15 Jan 14 00:05 ..
-rw-r--r-- 1 zeyer assi 104867035 Jan  6 04:53 i6_core.returnn.training.ReturnnTrainingJob.LgJ1IEE3O1g8.run.3967412.1
-rw-r--r-- 1 zeyer assi 207302157 Jan  9 20:15 i6_core.returnn.training.ReturnnTrainingJob.LgJ1IEE3O1g8.run.4000241.1
-rw-r--r-- 1 zeyer assi  10852444 Jan  9 23:46 i6_core.returnn.training.ReturnnTrainingJob.LgJ1IEE3O1g8.run.4058657.1
-rw-r--r-- 1 zeyer assi  17455053 Jan 10 08:08 i6_core.returnn.training.ReturnnTrainingJob.LgJ1IEE3O1g8.run.4061471.1
-rw-r--r-- 1 zeyer assi  76535987 Jan 11 11:10 i6_core.returnn.training.ReturnnTrainingJob.LgJ1IEE3O1g8.run.4064307.1
-rw-r--r-- 1 zeyer assi   2580756 Jan 11 11:55 i6_core.returnn.training.ReturnnTrainingJob.LgJ1IEE3O1g8.run.4090298.1
-rw-r--r-- 1 zeyer assi  23976868 Jan 11 19:14 i6_core.returnn.training.ReturnnTrainingJob.LgJ1IEE3O1g8.run.4090675.1
-rw-r--r-- 1 zeyer assi    373200 Jan 12 15:10 i6_core.returnn.training.ReturnnTrainingJob.LgJ1IEE3O1g8.run.4103762.1
-rw-r--r-- 1 zeyer assi  62070882 Jan 13 12:22 i6_core.returnn.training.ReturnnTrainingJob.LgJ1IEE3O1g8.run.4110798.1
-rw-r--r-- 2 zeyer assi  36560112 Jan 14 00:14 i6_core.returnn.training.ReturnnTrainingJob.LgJ1IEE3O1g8.run.4119424.1

The last job there:

--------------------- Slurm Task Prolog ------------------------
Job ID: 4119424
Job name: i6_core.returnn.training.ReturnnTrainingJob.LgJ1IEE3O1g8.run
Host: cn-230
Date: Sat Jan 13 12:23:13 PM CET 2024
User: zeyer 
Slurm account: hlt
Slurm partition: gpu_11gb 
Work dir: 
------------------ 
Node usage:
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
4119424_1  gpu_11gb i6_core.    zeyer  R       0:00      1 cn-230
------------------ 
Show launch script with:
sacct -B -j  
------------------
--------------------- Slurm Task Prolog ------------------------ 
...

This is still running.

The previous job:

--------------------- Slurm Task Epilog ------------------------ 
Job ID: 4110798
Time: Sat Jan 13 12:22:47 PM CET 2024
Elapsed Time: 21:13:16
Billing per second for TRES: billing=376,cpu=16,gres/gpu=4,mem=60G,node=1
Show resource usage with e.g.: 
sacct -j 4110798 -o Elapsed,TotalCPU,UserCPU,SystemCPU,MaxRSS,ReqTRES%60,MaxDiskRead,MaxDiskWrite
--------------------- Slurm Task Epilog ------------------------
slurmstepd-cn-230: error: Detected 1 oom-kill event(s) in StepId=4110798.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

I'm not really sure how the job got into the error state.

@michelwi
Copy link
Contributor

I think seeing an error.run.1 file on disk takes precedence over seeing the task running in slurm (or sge for that matter).

Now how did we get here? Was this maybe another case of job submitted to queue while another instance of the same job was still running? I imagine if the first instance errored while the second one was still pending then we would arrive at this situation.

Or maybe was the second instance submitted after the first instance crashed but before the manager found the error file?

@critias
Copy link
Contributor

critias commented Jan 14, 2024

I can confirm that the manager reports ans error as soon as the error file is found and also agree that the most likely scenario is that two jobs were submitted and one crashed.

That the manager restarted the job before it found the error file is unlikely. I would not rule it out, but there are steps to prevent this. Otherwise this would probably frequently happen to finished jobs as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants