Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timeout to SSH queue scan command #191

Merged
merged 11 commits into from
Jun 18, 2024
Merged

Conversation

Icemole
Copy link
Collaborator

@Icemole Icemole commented May 31, 2024

If the SSH key isn't properly set, sisyphus will schedule stacked queue scan commands which will ask for a password. This will lead to subprocess buildup, which eventually crashes the manager with a Too many open files error.

I've also changed the hardcoded 30 second interval for an interval depending on gs.WAIT_PERIOD_BETWEEN_CHECKS, which I find makes more sense. <- Edit: not anymore, see below.

Fix #190.

If the SSH key isn't properly set, sisyphus will schedule stacked queue scan commands which will ask for a password. This will lead to subprocess buildup, which eventually crashes the manager with a `Too many open files` error.
@albertz
Copy link
Member

albertz commented May 31, 2024

If the SSH key isn't properly set, sisyphus will schedule stacked queue scan commands which will ask for a password

So you add "-o", "ConnectTimeout" to the SSH command? That seems like a bad workaround to me for this problem.

What about -o BatchMode=yes?

Or -o PasswordAuthentication=no?

@albertz
Copy link
Member

albertz commented May 31, 2024

... sisyphus will schedule stacked queue scan commands which will ask for a password. This will lead to subprocess buildup, which eventually crashes the manager with a Too many open files error.

Also, I wonder about this. How can this happen? Are there uncleaned zombie procs? Did you check this? I think this is another separate thing you can and should fix.

Edit I just checked. I think p.communicate(..., timeout=...) will actually not kill the proc in case the timeout happened. And when p goes out of scope, Python GC will also not kill the proc. Actually, in Popen.__del__, in case the proc is still alive, there is this logic:

        if self.returncode is None and _active is not None:
            # Child is still running, keep us alive until we can wait on it.
            _active.append(self)

So, it means, if the proc never dies itself, it will always stay alive. To kill it, you must explicitly kill it somewhere, but it seems we never do that?

Btw, if we would just use subprocess.run, it would properly cleanup (kill) the subproc in case of a timeout. So instead of p = subprocess.Popen(...) and then p.communicate, just directly use subprocess.run.

@Icemole
Copy link
Collaborator Author

Icemole commented May 31, 2024

Btw, if we would just use subprocess.run, it would properly cleanup (kill) the subproc in case of a timeout. So instead of p = subprocess.Popen(...) and then p.communicate, just directly use subprocess.run.

I think that's a much more intuitive solution than what I proposed.

@albertz
Copy link
Member

albertz commented May 31, 2024

Btw, if we would just use subprocess.run, it would properly cleanup (kill) the subproc in case of a timeout. So instead of p = subprocess.Popen(...) and then p.communicate, just directly use subprocess.run.

I think that's a much more intuitive solution than what I proposed.

Additionally, I still would add the -o BatchMode=yes or -o PasswordAuthentication=no though. There is no need to wait for the timeout.

@albertz
Copy link
Member

albertz commented May 31, 2024

Btw, I wonder about similar wrong usages of subprocess.Popen in Sisyphus. There are probably a few. I just checked, it seems like in most of our backend engines, we have it wrong in the same way. Maybe you should apply the same fix for all of them?

Avoid redundant lines of code
@Icemole
Copy link
Collaborator Author

Icemole commented May 31, 2024

Btw, I wonder about similar wrong usages of subprocess.Popen in Sisyphus. There are probably a few. I just checked, it seems like in most of our backend engines, we have it wrong in the same way. Maybe you should apply the same fix for all of them?

I can do that as well. I actually envisioned joining that code into another common file because some of it is the exact same code, and we would be removing a bit of code duplication.

@albertz
Copy link
Member

albertz commented May 31, 2024

Btw, I wonder about similar wrong usages of subprocess.Popen in Sisyphus. There are probably a few. I just checked, it seems like in most of our backend engines, we have it wrong in the same way. Maybe you should apply the same fix for all of them?

I can do that as well.

I see the same wrong usage in these files:

  • aws_batch_engine.py
  • load_sharing_facility_engine.py
  • simple_linux_utility_for_resource_management_engine.py
  • son_of_grid_engine.py

I actually envisioned joining that code into another common file because some of it is the exact same code, and we would be removing a bit of code duplication.

I would not do that for now. It's basically one line of code (subprocess.run). There is not really much code duplication. Moving this one line of code elsewhere would just make it more complicated. Or do you also mean the gateway logic? But in any case, I would not do this here in this PR.

@critias
Copy link
Contributor

critias commented May 31, 2024

I would not do that for now. It's basically one line of code (subprocess.run). There is not really much code duplication. Moving this one line of code elsewhere would just make it more complicated. Or do you also mean the gateway logic? But in any case, I would not do this here in this PR.

I agree with Albert here.
And thanks for figuring this out, I didn't realized that the subprocess.Popen process would stay open.

@Icemole
Copy link
Collaborator Author

Icemole commented May 31, 2024

Roger that :) will implement right away.

@Icemole Icemole requested a review from albertz May 31, 2024 16:21
@albertz
Copy link
Member

albertz commented May 31, 2024

Looks all fine now. Except maybe this aspect:

I've also changed the hardcoded 30 second interval for an interval depending on gs.WAIT_PERIOD_BETWEEN_CHECKS, which I find makes more sense.

I wonder if that make sense to reuse this setting for the timeout. I personally would have kept those decoupled. I.e. for now, leave the 30 seconds, and if you want to be able to configure that, make a separate setting for it.

@Icemole
Copy link
Collaborator Author

Icemole commented Jun 3, 2024

Looks all fine now. Except maybe this aspect:

I've also changed the hardcoded 30 second interval for an interval depending on gs.WAIT_PERIOD_BETWEEN_CHECKS, which I find makes more sense.

I wonder if that make sense to reuse this setting for the timeout. I personally would have kept those decoupled. I.e. for now, leave the 30 seconds, and if you want to be able to configure that, make a separate setting for it.

Okay, I'll change it back. I thought it made sense given the name.

For now I'll change it back to 30 seconds, but upon reviewing the global_settings.py file, I found the following potential candidates ("separate settings") already in there:

  • ‎WAIT_PERIOD_SSH_TIMEOUT.
  • WAIT_PERIOD_QSTAT_PARSING.

@Icemole
Copy link
Collaborator Author

Icemole commented Jun 13, 2024

I've been testing this, and while my manager didn't crash anymore because of too many open files, there was an abnormally high memory usage that extended throughout time whenever the subprocesses didn't successfully finish:

[2024-06-13 07:24:26,842] INFO: Submit to queue: work/i6_core/returnn/extract_prior/ReturnnComputePriorJobV2.yYMFtGVTZgnn run [1]
[2024-06-13 07:24:26,953] ERROR: Error to submit job, return value: 255
[2024-06-13 07:24:26,953] ERROR: SBATCH command: sbatch -J i6_core.returnn.extract_prior.ReturnnComputePriorJobV2.yYMFtGVTZgnn.run -o work/i6_core/returnn/extract_prior/ReturnnComputePriorJobV2.yYMFtGVTZgnn/engine/%x.%A.%a [...] --wrap=/usr/bin/python3 /home/nbeneitez/work/sisyphus/too-many-open-files-fix/recipe/sisyphus/sis worker --engine long work/i6_core/returnn/extract_prior/ReturnnComputePriorJobV2.yYMFtGVTZgnn run
[2024-06-13 07:24:26,953] ERROR: Error: [...] Permission denied (publickey).

When messages like this one were spammed, I noticed a very high memory usage from the manager's side. I'll try to review the code a bit more.

@albertz
Copy link
Member

albertz commented Jun 13, 2024

The code looks fine to me though (sorry, forgot to review again, did that now).

You should debug this better. I.e. while you see the memory increase, is this memory increase really in the main proc of the manager? Or maybe just in some sub procs? Are the timed-out sub procs properly cleaned up (i.e. no zombie procs hanging around)?

if send_to_stdin:
send_to_stdin = send_to_stdin.encode()
out, err = p.communicate(input=send_to_stdin, timeout=30)
p = subprocess.run(system_command, input=send_to_stdin, capture_output=True, timeout=30)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, don't you need to catch TimeoutExpired?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right, I'll ignore any TimeoutExpired.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what does it mean? What happens now? The whole manager crashes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I had any TimeoutExpired exception, my error was within the timeout because of a public key mismatch, and it was actually outputted within the first second after the command was run:

[2024-06-13 07:24:26,842] INFO: Submit to queue: work/i6_core/returnn/extract_prior/ReturnnComputePriorJobV2.yYMFtGVTZgnn run [1]
[2024-06-13 07:24:26,953] ERROR: Error to submit job, return value: 255

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I realize now, in the code before your change, we also did not catch TimeoutExpired (in the p.wait(timeout=30)). I wonder why we did not catch it, and what happened when this exception was thrown there. Do you know? (As I understood you, you have run into this exact problem.)

Maybe we should not change this and also not catch TimeoutExpired? @critias?

@albertz
Copy link
Member

albertz commented Jun 13, 2024

I'm testing with a simple script whether there are any problems w.r.t. memory consumption or so. Script here, or just directly here:

import subprocess

while True:
    try:
        subprocess.run(["cat", "/dev/zero"], timeout=0.01)
    except subprocess.TimeoutExpired:
        print("TimeoutExpired")

Just run this, and watch the memory consumption meanwhile. I don't see that there is any increase in memory.

So, I don't think that this is the problem.

@albertz
Copy link
Member

albertz commented Jun 13, 2024

When messages like this one were spammed

What do you mean by spammed? How often do you get them? It should wait each time for the timeout, or not?

@Icemole
Copy link
Collaborator Author

Icemole commented Jun 13, 2024

You should debug this better. I.e. while you see the memory increase, is this memory increase really in the main proc of the manager? Or maybe just in some sub procs? Are the timed-out sub procs properly cleaned up (i.e. no zombie procs hanging around)?

Agreed, I panicked because I was making our head node go super slow and canceled the program 😅 I'll make a mental note to scan /proc thoroughly next time this happens.

What do you mean by spammed? How often do you get them? It should wait each time for the timeout, or not?

It doesn't wait for the full timeout (30 seconds), but it's not spammy (as in < 1 second per print) either:

[2024-06-13 06:30:15,908] INFO: Submit to queue: work/i6_core/returnn/extract_prior/ReturnnComputePriorJobV2.yYMFtGVTZgnn run [1]                                          
[2024-06-13 06:30:16,021] ERROR: Error to submit job, return value: 255     
...                                                                                               
[2024-06-13 06:30:28,284] INFO: Submit to queue: work/i6_core/returnn/extract_prior/ReturnnComputePriorJobV2.yYMFtGVTZgnn run [1]                                          

The job immediately fails because of Permission denied (publickey)., and it seems to wait 10-20 seconds more before running another job.

@Icemole
Copy link
Collaborator Author

Icemole commented Jun 14, 2024

I got into the issue again. So I scanned the process, and it seems to open many submit_log.run file descriptors at some point and immediately close them, and do nothing else for 15-30 seconds. The ls commands below were run with a timeout of 0.5 seconds more or less:

$ ls -lha /proc/3442960/fd
total 0
dr-x------ 2 nbeneitez domain_users  0 Jun 13 04:12 .
dr-xr-xr-x 9 nbeneitez domain_users  0 Jun 13 04:12 ..
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 0 -> /dev/pts/8
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 1 -> /dev/pts/8
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 10 -> '/dev/shm/pym-3442960-9n996ekj (deleted)'
lr-x------ 1 nbeneitez domain_users 64 Jun 13 04:12 11 -> 'pipe:[474983584]'
l-wx------ 1 nbeneitez domain_users 64 Jun 13 04:12 12 -> 'pipe:[474983584]'
lr-x------ 1 nbeneitez domain_users 64 Jun 13 04:12 13 -> 'pipe:[474967409]'
l-wx------ 1 nbeneitez domain_users 64 Jun 13 04:12 14 -> 'pipe:[474967409]'
lr-x------ 1 nbeneitez domain_users 64 Jun 13 04:12 15 -> 'pipe:[474988135]'
l-wx------ 1 nbeneitez domain_users 64 Jun 13 04:13 16 -> 'pipe:[474988135]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 2 -> /dev/pts/8
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 3 -> 'anon_inode:[eventpoll]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 4 -> 'socket:[474942638]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 5 -> 'socket:[474942639]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 6 -> 'anon_inode:[eventpoll]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 7 -> 'socket:[474942644]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 8 -> 'socket:[474942645]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 9 -> '/dev/shm/pym-3442960-9n996ekj (deleted)'

$ ls -lha /proc/3442960/fd
total 0
dr-x------ 2 nbeneitez domain_users  0 Jun 13 04:12 .
dr-xr-xr-x 9 nbeneitez domain_users  0 Jun 13 04:12 ..
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 0 -> /dev/pts/8
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 1 -> /dev/pts/8
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 10 -> '/dev/shm/pym-3442960-9n996ekj (deleted)'
lr-x------ 1 nbeneitez domain_users 64 Jun 13 04:12 11 -> 'pipe:[474983584]'
l-wx------ 1 nbeneitez domain_users 64 Jun 13 04:12 12 -> 'pipe:[474983584]'
lr-x------ 1 nbeneitez domain_users 64 Jun 13 04:12 13 -> 'pipe:[474967409]'
l-wx------ 1 nbeneitez domain_users 64 Jun 13 04:12 14 -> 'pipe:[474967409]'
lr-x------ 1 nbeneitez domain_users 64 Jun 13 04:12 15 -> 'pipe:[474988135]'
l-wx------ 1 nbeneitez domain_users 64 Jun 13 04:13 16 -> 'pipe:[474988135]'
lr-x------ 1 nbeneitez domain_users 64 Jun 14 06:06 17 -> 'pipe:[518288582]'
lr-x------ 1 nbeneitez domain_users 64 Jun 14 05:58 18 -> .../submit_log.run
lr-x------ 1 nbeneitez domain_users 64 Jun 14 06:06 19 -> .../submit_log.run
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 2 -> /dev/pts/8
lr-x------ 1 nbeneitez domain_users 64 Jun 14 05:43 20 -> .../submit_log.run
l-wx------ 1 nbeneitez domain_users 64 Jun 14 05:15 21 -> .../submit_log.run
lr-x------ 1 nbeneitez domain_users 64 Jun 14 05:21 22 -> .../submit_log.run
lr-x------ 1 nbeneitez domain_users 64 Jun 14 05:59 23 -> .../submit_log.run
lr-x------ 1 nbeneitez domain_users 64 Jun 14 05:21 24 -> .../submit_log.run
lr-x------ 1 nbeneitez domain_users 64 Jun 14 04:54 25 -> .../submit_log.run
l-wx------ 1 nbeneitez domain_users 64 Jun 14 05:52 26 -> 'pipe:[518288582]'
lr-x------ 1 nbeneitez domain_users 64 Jun 14 04:53 27 -> 'pipe:[518288583]'
l-wx------ 1 nbeneitez domain_users 64 Jun 14 06:16 28 -> 'pipe:[518288583]'
lr-x------ 1 nbeneitez domain_users 64 Jun 14 06:02 29 -> .../submit_log.run
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 3 -> 'anon_inode:[eventpoll]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 4 -> 'socket:[474942638]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 5 -> 'socket:[474942639]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 6 -> 'anon_inode:[eventpoll]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 7 -> 'socket:[474942644]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 8 -> 'socket:[474942645]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 9 -> '/dev/shm/pym-3442960-9n996ekj (deleted)'

$ ls -lha /proc/3442960/fd
total 0
dr-x------ 2 nbeneitez domain_users  0 Jun 13 04:12 .
dr-xr-xr-x 9 nbeneitez domain_users  0 Jun 13 04:12 ..
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 0 -> /dev/pts/8
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 1 -> /dev/pts/8
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 10 -> '/dev/shm/pym-3442960-9n996ekj (deleted)'
lr-x------ 1 nbeneitez domain_users 64 Jun 13 04:12 11 -> 'pipe:[474983584]'
l-wx------ 1 nbeneitez domain_users 64 Jun 13 04:12 12 -> 'pipe:[474983584]'
lr-x------ 1 nbeneitez domain_users 64 Jun 13 04:12 13 -> 'pipe:[474967409]'
l-wx------ 1 nbeneitez domain_users 64 Jun 13 04:12 14 -> 'pipe:[474967409]'
lr-x------ 1 nbeneitez domain_users 64 Jun 13 04:12 15 -> 'pipe:[474988135]'
l-wx------ 1 nbeneitez domain_users 64 Jun 13 04:13 16 -> 'pipe:[474988135]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 2 -> /dev/pts/8
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 3 -> 'anon_inode:[eventpoll]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 4 -> 'socket:[474942638]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 5 -> 'socket:[474942639]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 6 -> 'anon_inode:[eventpoll]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 7 -> 'socket:[474942644]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 8 -> 'socket:[474942645]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 9 -> '/dev/shm/pym-3442960-9n996ekj (deleted)'

All submit_log.run were pointing to different jobs. I don't see anything suspicious about this, it's just writing the queuing log to these files and then closing them. In general there seem to be 16 opened files, many of which are sockets and pipes.

The memory usage is right now at 6.4% of 250 GB and it keeps growing slowly but surely (update: 6.6% at the time of finishing the comment, so 10 minutes after more or less). ps aux | grep <manager-pid> only shows a process:

USER         PID %CPU %MEM      VSZ   RSS     TTY      STAT  START    TIME   COMMAND
nbeneit+ 3442960  164  6.5 22736408 17258504 pts/8 Sl+ Jun13 2585:55 python3 sis m config/training.py

but htop does show many processes with the same name but different PID, some of which have been running for some hours already. See the two images below.
imagen
imagen

In proc/3442960/task I see 106 subdirectories. So the program seems to spawn many threads, some of which aren't being killed, since according to man 5 proc:

Underneath each of the /proc/pid directories, a task subdirectory contains subdirectories of the form task/tid, which contain corresponding information about each of the threads in the process, where tid is the kernel thread ID of the thread.

subprocess.run waits for the program to complete and then returns. Is it possible that some of the processes aren't being finished? But we should have the timeout for this 🤔 why are some subprocesses not being cleaned?

Update: I watched ls /proc/3442960/task | wc for a minute or so but the number of active threads didn't increase.

@albertz
Copy link
Member

albertz commented Jun 14, 2024

Why do you actually think the memory leak is related to the PR here? It only increases when you get those "Error to submit job" and otherwise stays constant?

So are there alive subprocs or not? Are there any zombie procs? (What does ps ax say?)

What are the threads? (E.g. what does pyspy or pystack say?)

Where is the memory allocated? (What does a mem profiler say?)

@Icemole
Copy link
Collaborator Author

Icemole commented Jun 14, 2024

Yes, it could very well be the case that there's some other error going on that's not related to the PR, but I was able to observe it after doing this change, so I've reported it here. If you want we can move this discussion to another PR.

There are no zombie processes. ps ax | grep Z reports nothing.

Sadly I don't have sudo powers and ptrace is not enabled, so I can't dump stats for pystack. However, there are exactly 106 warnings with the message WARNING(process_remote): Failed to attach to thread <pid>: Operation not permitted, with the PIDs being exactly the same as the ones in /proc/3442960/task. Besides, I've just noticed that all processes on /proc/3442960/task are a sequence order, which might point to the fact that they've been generated at the same time.

@albertz
Copy link
Member

albertz commented Jun 14, 2024

sudo powers and ptrace is not enabled

You don't need sudo for that. Just some admin has to enable ptrace. Then you can use it without ptrace. Please do that. I doesn't make sense that we waste time here by just guessing around without properly just debugging the issue.

@Icemole
Copy link
Collaborator Author

Icemole commented Jun 17, 2024

Thanks for the suggestion, I've already asked to enable ptrace for everyone. In the meantime, one of our admins got me the trace we were looking for: pystack_threads_tracker.log.

From the log it would seem as if half of the threads (~50) were just starting up (see task = get()), and the other half were waiting for some job to finish. Moreover, the latter seem to be training-related since they go through i6_core/returnn/training.py. As you said, I indeed don't see any abnormal behavior here...

@albertz
Copy link
Member

albertz commented Jun 17, 2024

got me the trace we were looking for ...

Is this the trace at the time when the problem occurs, i.e. during messages like this one were spammed, or during memory increase when watching memory usage in htop or so? It's important to get the trace at exactly this time. Also not just one trace but several traces during that period, to be sure you don't miss some interesting bits.

From that trace, I can also only say that there don't seem to be any weird threads running. But as you said, there are no more threads coming up (specifically during the mem increase period), the threads are likely not the issue. Also, as there are no zombie procs, or too many subprocs, nor too many open files, I don't see any indication that the mem leak is related to this PR here. I still wonder why you got this idea that it might be related. Or this abnormally high memory usage which you observe only occurs now with this PR and has never occurred before?

As the next step, I would do some actual memory profiling, to see where the memory increase occurs in the code.

In any case, I think independently of this, we can go forward with this PR here. There are some outstanding issues which I commented on.

@Icemole
Copy link
Collaborator Author

Icemole commented Jun 18, 2024

Definitely agreed, this was off-topic.

@albertz I've fixed your feedback.

@Icemole Icemole merged commit 51db853 into master Jun 18, 2024
3 checks passed
@Icemole Icemole deleted the too-many-open-files-fix branch June 18, 2024 07:15
out, err = p.communicate(input=send_to_stdin, timeout=30)
try:
p = subprocess.run(system_command, input=send_to_stdin, capture_output=True, timeout=30)
except subprocess.TimeoutExpired:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we catch the subprocess.TimeoutExpired exception here and return, then the whole logic about retrying in submit_helper

while True:
try:
out, err, retval = self.system_call(sbatch_call)
except subprocess.TimeoutExpired:
logging.warning(self._system_call_timeout_warn_msg(command))
time.sleep(gs.WAIT_PERIOD_SSH_TIMEOUT)
continue
break
is obsolet.
If this was intentional, then maybe also remove this logic? and gs.WAIT_PERIOD_SSH_TIMEOUT would be ignored then

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah so it is actually handled. That was my earlier question about this. So then I would not catch it. Unless there is good reason to change this old behavior about TimeoutExpired.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes... I hadn't noticed that it would be catched later in the call stack. See #196.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Too many open file descriptors
4 participants