Add timeout to SSH queue scan command #191

Icemole · 2024-05-31T08:44:29Z

If the SSH key isn't properly set, sisyphus will schedule stacked queue scan commands which will ask for a password. This will lead to subprocess buildup, which eventually crashes the manager with a Too many open files error.

I've also changed the hardcoded 30 second interval for an interval depending on gs.WAIT_PERIOD_BETWEEN_CHECKS, which I find makes more sense. <- Edit: not anymore, see below.

Fix #190.

If the SSH key isn't properly set, sisyphus will schedule stacked queue scan commands which will ask for a password. This will lead to subprocess buildup, which eventually crashes the manager with a `Too many open files` error.

albertz · 2024-05-31T08:53:42Z

If the SSH key isn't properly set, sisyphus will schedule stacked queue scan commands which will ask for a password

So you add "-o", "ConnectTimeout" to the SSH command? That seems like a bad workaround to me for this problem.

What about -o BatchMode=yes?

Or -o PasswordAuthentication=no?

albertz · 2024-05-31T09:06:51Z

... sisyphus will schedule stacked queue scan commands which will ask for a password. This will lead to subprocess buildup, which eventually crashes the manager with a Too many open files error.

Also, I wonder about this. How can this happen? Are there uncleaned zombie procs? Did you check this? I think this is another separate thing you can and should fix.

Edit I just checked. I think p.communicate(..., timeout=...) will actually not kill the proc in case the timeout happened. And when p goes out of scope, Python GC will also not kill the proc. Actually, in Popen.__del__, in case the proc is still alive, there is this logic:

        if self.returncode is None and _active is not None:
            # Child is still running, keep us alive until we can wait on it.
            _active.append(self)

So, it means, if the proc never dies itself, it will always stay alive. To kill it, you must explicitly kill it somewhere, but it seems we never do that?

Btw, if we would just use subprocess.run, it would properly cleanup (kill) the subproc in case of a timeout. So instead of p = subprocess.Popen(...) and then p.communicate, just directly use subprocess.run.

Icemole · 2024-05-31T09:58:32Z

Btw, if we would just use subprocess.run, it would properly cleanup (kill) the subproc in case of a timeout. So instead of p = subprocess.Popen(...) and then p.communicate, just directly use subprocess.run.

I think that's a much more intuitive solution than what I proposed.

albertz · 2024-05-31T10:00:38Z

Btw, if we would just use subprocess.run, it would properly cleanup (kill) the subproc in case of a timeout. So instead of p = subprocess.Popen(...) and then p.communicate, just directly use subprocess.run.

I think that's a much more intuitive solution than what I proposed.

Additionally, I still would add the -o BatchMode=yes or -o PasswordAuthentication=no though. There is no need to wait for the timeout.

This way password requests are disabled.

sisyphus/simple_linux_utility_for_resource_management_engine.py

albertz · 2024-05-31T10:17:01Z

Btw, I wonder about similar wrong usages of subprocess.Popen in Sisyphus. There are probably a few. I just checked, it seems like in most of our backend engines, we have it wrong in the same way. Maybe you should apply the same fix for all of them?

Avoid redundant lines of code

sisyphus/simple_linux_utility_for_resource_management_engine.py

sisyphus/son_of_grid_engine.py

sisyphus/simple_linux_utility_for_resource_management_engine.py

sisyphus/son_of_grid_engine.py

Icemole · 2024-05-31T10:19:37Z

Btw, I wonder about similar wrong usages of subprocess.Popen in Sisyphus. There are probably a few. I just checked, it seems like in most of our backend engines, we have it wrong in the same way. Maybe you should apply the same fix for all of them?

I can do that as well. I actually envisioned joining that code into another common file because some of it is the exact same code, and we would be removing a bit of code duplication.

albertz · 2024-05-31T10:22:50Z

Btw, I wonder about similar wrong usages of subprocess.Popen in Sisyphus. There are probably a few. I just checked, it seems like in most of our backend engines, we have it wrong in the same way. Maybe you should apply the same fix for all of them?

I can do that as well.

I see the same wrong usage in these files:

aws_batch_engine.py
load_sharing_facility_engine.py
simple_linux_utility_for_resource_management_engine.py
son_of_grid_engine.py

I actually envisioned joining that code into another common file because some of it is the exact same code, and we would be removing a bit of code duplication.

I would not do that for now. It's basically one line of code (subprocess.run). There is not really much code duplication. Moving this one line of code elsewhere would just make it more complicated. Or do you also mean the gateway logic? But in any case, I would not do this here in this PR.

critias · 2024-05-31T12:44:17Z

I would not do that for now. It's basically one line of code (subprocess.run). There is not really much code duplication. Moving this one line of code elsewhere would just make it more complicated. Or do you also mean the gateway logic? But in any case, I would not do this here in this PR.

I agree with Albert here.
And thanks for figuring this out, I didn't realized that the subprocess.Popen process would stay open.

Icemole · 2024-05-31T16:14:44Z

Roger that :) will implement right away.

albertz · 2024-05-31T19:20:56Z

Looks all fine now. Except maybe this aspect:

I've also changed the hardcoded 30 second interval for an interval depending on gs.WAIT_PERIOD_BETWEEN_CHECKS, which I find makes more sense.

I wonder if that make sense to reuse this setting for the timeout. I personally would have kept those decoupled. I.e. for now, leave the 30 seconds, and if you want to be able to configure that, make a separate setting for it.

Icemole · 2024-06-03T07:03:19Z

Looks all fine now. Except maybe this aspect:

I've also changed the hardcoded 30 second interval for an interval depending on gs.WAIT_PERIOD_BETWEEN_CHECKS, which I find makes more sense.

I wonder if that make sense to reuse this setting for the timeout. I personally would have kept those decoupled. I.e. for now, leave the 30 seconds, and if you want to be able to configure that, make a separate setting for it.

Okay, I'll change it back. I thought it made sense given the name.

For now I'll change it back to 30 seconds, but upon reviewing the global_settings.py file, I found the following potential candidates ("separate settings") already in there:

‎‎WAIT_PERIOD_SSH_TIMEOUT.
WAIT_PERIOD_QSTAT_PARSING.

Icemole · 2024-06-13T08:08:30Z

I've been testing this, and while my manager didn't crash anymore because of too many open files, there was an abnormally high memory usage that extended throughout time whenever the subprocesses didn't successfully finish:

[2024-06-13 07:24:26,842] INFO: Submit to queue: work/i6_core/returnn/extract_prior/ReturnnComputePriorJobV2.yYMFtGVTZgnn run [1]
[2024-06-13 07:24:26,953] ERROR: Error to submit job, return value: 255
[2024-06-13 07:24:26,953] ERROR: SBATCH command: sbatch -J i6_core.returnn.extract_prior.ReturnnComputePriorJobV2.yYMFtGVTZgnn.run -o work/i6_core/returnn/extract_prior/ReturnnComputePriorJobV2.yYMFtGVTZgnn/engine/%x.%A.%a [...] --wrap=/usr/bin/python3 /home/nbeneitez/work/sisyphus/too-many-open-files-fix/recipe/sisyphus/sis worker --engine long work/i6_core/returnn/extract_prior/ReturnnComputePriorJobV2.yYMFtGVTZgnn run
[2024-06-13 07:24:26,953] ERROR: Error: [...] Permission denied (publickey).

When messages like this one were spammed, I noticed a very high memory usage from the manager's side. I'll try to review the code a bit more.

albertz · 2024-06-13T08:35:37Z

The code looks fine to me though (sorry, forgot to review again, did that now).

You should debug this better. I.e. while you see the memory increase, is this memory increase really in the main proc of the manager? Or maybe just in some sub procs? Are the timed-out sub procs properly cleaned up (i.e. no zombie procs hanging around)?

albertz · 2024-06-13T08:42:18Z

sisyphus/simple_linux_utility_for_resource_management_engine.py

        if send_to_stdin:
            send_to_stdin = send_to_stdin.encode()
-        out, err = p.communicate(input=send_to_stdin, timeout=30)
+        p = subprocess.run(system_command, input=send_to_stdin, capture_output=True, timeout=30)


I wonder, don't you need to catch TimeoutExpired?

Yes, you're right, I'll ignore any TimeoutExpired.

So what does it mean? What happens now? The whole manager crashes?

I don't think I had any TimeoutExpired exception, my error was within the timeout because of a public key mismatch, and it was actually outputted within the first second after the command was run:

[2024-06-13 07:24:26,842] INFO: Submit to queue: work/i6_core/returnn/extract_prior/ReturnnComputePriorJobV2.yYMFtGVTZgnn run [1] [2024-06-13 07:24:26,953] ERROR: Error to submit job, return value: 255

Actually, I realize now, in the code before your change, we also did not catch TimeoutExpired (in the p.wait(timeout=30)). I wonder why we did not catch it, and what happened when this exception was thrown there. Do you know? (As I understood you, you have run into this exact problem.)

Maybe we should not change this and also not catch TimeoutExpired? @critias?

albertz · 2024-06-13T09:05:42Z

I'm testing with a simple script whether there are any problems w.r.t. memory consumption or so. Script here, or just directly here:

import subprocess

while True:
    try:
        subprocess.run(["cat", "/dev/zero"], timeout=0.01)
    except subprocess.TimeoutExpired:
        print("TimeoutExpired")

Just run this, and watch the memory consumption meanwhile. I don't see that there is any increase in memory.

So, I don't think that this is the problem.

albertz · 2024-06-13T09:07:48Z

When messages like this one were spammed

What do you mean by spammed? How often do you get them? It should wait each time for the timeout, or not?

Icemole · 2024-06-13T11:49:55Z

You should debug this better. I.e. while you see the memory increase, is this memory increase really in the main proc of the manager? Or maybe just in some sub procs? Are the timed-out sub procs properly cleaned up (i.e. no zombie procs hanging around)?

Agreed, I panicked because I was making our head node go super slow and canceled the program 😅 I'll make a mental note to scan /proc thoroughly next time this happens.

What do you mean by spammed? How often do you get them? It should wait each time for the timeout, or not?

It doesn't wait for the full timeout (30 seconds), but it's not spammy (as in < 1 second per print) either:

[2024-06-13 06:30:15,908] INFO: Submit to queue: work/i6_core/returnn/extract_prior/ReturnnComputePriorJobV2.yYMFtGVTZgnn run [1]                                          
[2024-06-13 06:30:16,021] ERROR: Error to submit job, return value: 255     
...                                                                                               
[2024-06-13 06:30:28,284] INFO: Submit to queue: work/i6_core/returnn/extract_prior/ReturnnComputePriorJobV2.yYMFtGVTZgnn run [1]

The job immediately fails because of Permission denied (publickey)., and it seems to wait 10-20 seconds more before running another job.

sisyphus/simple_linux_utility_for_resource_management_engine.py

Icemole · 2024-06-14T10:39:43Z

I got into the issue again. So I scanned the process, and it seems to open many submit_log.run file descriptors at some point and immediately close them, and do nothing else for 15-30 seconds. The ls commands below were run with a timeout of 0.5 seconds more or less:

$ ls -lha /proc/3442960/fd
total 0
dr-x------ 2 nbeneitez domain_users  0 Jun 13 04:12 .
dr-xr-xr-x 9 nbeneitez domain_users  0 Jun 13 04:12 ..
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 0 -> /dev/pts/8
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 1 -> /dev/pts/8
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 10 -> '/dev/shm/pym-3442960-9n996ekj (deleted)'
lr-x------ 1 nbeneitez domain_users 64 Jun 13 04:12 11 -> 'pipe:[474983584]'
l-wx------ 1 nbeneitez domain_users 64 Jun 13 04:12 12 -> 'pipe:[474983584]'
lr-x------ 1 nbeneitez domain_users 64 Jun 13 04:12 13 -> 'pipe:[474967409]'
l-wx------ 1 nbeneitez domain_users 64 Jun 13 04:12 14 -> 'pipe:[474967409]'
lr-x------ 1 nbeneitez domain_users 64 Jun 13 04:12 15 -> 'pipe:[474988135]'
l-wx------ 1 nbeneitez domain_users 64 Jun 13 04:13 16 -> 'pipe:[474988135]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 2 -> /dev/pts/8
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 3 -> 'anon_inode:[eventpoll]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 4 -> 'socket:[474942638]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 5 -> 'socket:[474942639]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 6 -> 'anon_inode:[eventpoll]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 7 -> 'socket:[474942644]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 8 -> 'socket:[474942645]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 9 -> '/dev/shm/pym-3442960-9n996ekj (deleted)'

$ ls -lha /proc/3442960/fd
total 0
dr-x------ 2 nbeneitez domain_users  0 Jun 13 04:12 .
dr-xr-xr-x 9 nbeneitez domain_users  0 Jun 13 04:12 ..
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 0 -> /dev/pts/8
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 1 -> /dev/pts/8
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 10 -> '/dev/shm/pym-3442960-9n996ekj (deleted)'
lr-x------ 1 nbeneitez domain_users 64 Jun 13 04:12 11 -> 'pipe:[474983584]'
l-wx------ 1 nbeneitez domain_users 64 Jun 13 04:12 12 -> 'pipe:[474983584]'
lr-x------ 1 nbeneitez domain_users 64 Jun 13 04:12 13 -> 'pipe:[474967409]'
l-wx------ 1 nbeneitez domain_users 64 Jun 13 04:12 14 -> 'pipe:[474967409]'
lr-x------ 1 nbeneitez domain_users 64 Jun 13 04:12 15 -> 'pipe:[474988135]'
l-wx------ 1 nbeneitez domain_users 64 Jun 13 04:13 16 -> 'pipe:[474988135]'
lr-x------ 1 nbeneitez domain_users 64 Jun 14 06:06 17 -> 'pipe:[518288582]'
lr-x------ 1 nbeneitez domain_users 64 Jun 14 05:58 18 -> .../submit_log.run
lr-x------ 1 nbeneitez domain_users 64 Jun 14 06:06 19 -> .../submit_log.run
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 2 -> /dev/pts/8
lr-x------ 1 nbeneitez domain_users 64 Jun 14 05:43 20 -> .../submit_log.run
l-wx------ 1 nbeneitez domain_users 64 Jun 14 05:15 21 -> .../submit_log.run
lr-x------ 1 nbeneitez domain_users 64 Jun 14 05:21 22 -> .../submit_log.run
lr-x------ 1 nbeneitez domain_users 64 Jun 14 05:59 23 -> .../submit_log.run
lr-x------ 1 nbeneitez domain_users 64 Jun 14 05:21 24 -> .../submit_log.run
lr-x------ 1 nbeneitez domain_users 64 Jun 14 04:54 25 -> .../submit_log.run
l-wx------ 1 nbeneitez domain_users 64 Jun 14 05:52 26 -> 'pipe:[518288582]'
lr-x------ 1 nbeneitez domain_users 64 Jun 14 04:53 27 -> 'pipe:[518288583]'
l-wx------ 1 nbeneitez domain_users 64 Jun 14 06:16 28 -> 'pipe:[518288583]'
lr-x------ 1 nbeneitez domain_users 64 Jun 14 06:02 29 -> .../submit_log.run
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 3 -> 'anon_inode:[eventpoll]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 4 -> 'socket:[474942638]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 5 -> 'socket:[474942639]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 6 -> 'anon_inode:[eventpoll]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 7 -> 'socket:[474942644]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 8 -> 'socket:[474942645]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 9 -> '/dev/shm/pym-3442960-9n996ekj (deleted)'

$ ls -lha /proc/3442960/fd
total 0
dr-x------ 2 nbeneitez domain_users  0 Jun 13 04:12 .
dr-xr-xr-x 9 nbeneitez domain_users  0 Jun 13 04:12 ..
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 0 -> /dev/pts/8
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 1 -> /dev/pts/8
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 10 -> '/dev/shm/pym-3442960-9n996ekj (deleted)'
lr-x------ 1 nbeneitez domain_users 64 Jun 13 04:12 11 -> 'pipe:[474983584]'
l-wx------ 1 nbeneitez domain_users 64 Jun 13 04:12 12 -> 'pipe:[474983584]'
lr-x------ 1 nbeneitez domain_users 64 Jun 13 04:12 13 -> 'pipe:[474967409]'
l-wx------ 1 nbeneitez domain_users 64 Jun 13 04:12 14 -> 'pipe:[474967409]'
lr-x------ 1 nbeneitez domain_users 64 Jun 13 04:12 15 -> 'pipe:[474988135]'
l-wx------ 1 nbeneitez domain_users 64 Jun 13 04:13 16 -> 'pipe:[474988135]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 2 -> /dev/pts/8
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 3 -> 'anon_inode:[eventpoll]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 4 -> 'socket:[474942638]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 5 -> 'socket:[474942639]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 6 -> 'anon_inode:[eventpoll]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 7 -> 'socket:[474942644]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 8 -> 'socket:[474942645]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 9 -> '/dev/shm/pym-3442960-9n996ekj (deleted)'

All submit_log.run were pointing to different jobs. I don't see anything suspicious about this, it's just writing the queuing log to these files and then closing them. In general there seem to be 16 opened files, many of which are sockets and pipes.

The memory usage is right now at 6.4% of 250 GB and it keeps growing slowly but surely (update: 6.6% at the time of finishing the comment, so 10 minutes after more or less). ps aux | grep <manager-pid> only shows a process:

USER         PID %CPU %MEM      VSZ   RSS     TTY      STAT  START    TIME   COMMAND
nbeneit+ 3442960  164  6.5 22736408 17258504 pts/8 Sl+ Jun13 2585:55 python3 sis m config/training.py

but htop does show many processes with the same name but different PID, some of which have been running for some hours already. See the two images below.

In proc/3442960/task I see 106 subdirectories. So the program seems to spawn many threads, some of which aren't being killed, since according to man 5 proc:

Underneath each of the /proc/pid directories, a task subdirectory contains subdirectories of the form task/tid, which contain corresponding information about each of the threads in the process, where tid is the kernel thread ID of the thread.

subprocess.run waits for the program to complete and then returns. Is it possible that some of the processes aren't being finished? But we should have the timeout for this 🤔 why are some subprocesses not being cleaned?

Update: I watched ls /proc/3442960/task | wc for a minute or so but the number of active threads didn't increase.

albertz · 2024-06-14T11:10:17Z

Why do you actually think the memory leak is related to the PR here? It only increases when you get those "Error to submit job" and otherwise stays constant?

So are there alive subprocs or not? Are there any zombie procs? (What does ps ax say?)

What are the threads? (E.g. what does pyspy or pystack say?)

Where is the memory allocated? (What does a mem profiler say?)

Icemole · 2024-06-14T11:24:48Z

Yes, it could very well be the case that there's some other error going on that's not related to the PR, but I was able to observe it after doing this change, so I've reported it here. If you want we can move this discussion to another PR.

There are no zombie processes. ps ax | grep Z reports nothing.

Sadly I don't have sudo powers and ptrace is not enabled, so I can't dump stats for pystack. However, there are exactly 106 warnings with the message WARNING(process_remote): Failed to attach to thread <pid>: Operation not permitted, with the PIDs being exactly the same as the ones in /proc/3442960/task. Besides, I've just noticed that all processes on /proc/3442960/task are a sequence order, which might point to the fact that they've been generated at the same time.

albertz · 2024-06-14T11:36:10Z

sudo powers and ptrace is not enabled

You don't need sudo for that. Just some admin has to enable ptrace. Then you can use it without ptrace. Please do that. I doesn't make sense that we waste time here by just guessing around without properly just debugging the issue.

Icemole · 2024-06-17T08:34:12Z

Thanks for the suggestion, I've already asked to enable ptrace for everyone. In the meantime, one of our admins got me the trace we were looking for: pystack_threads_tracker.log.

From the log it would seem as if half of the threads (~50) were just starting up (see task = get()), and the other half were waiting for some job to finish. Moreover, the latter seem to be training-related since they go through i6_core/returnn/training.py. As you said, I indeed don't see any abnormal behavior here...

albertz · 2024-06-17T09:05:36Z

got me the trace we were looking for ...

Is this the trace at the time when the problem occurs, i.e. during messages like this one were spammed, or during memory increase when watching memory usage in htop or so? It's important to get the trace at exactly this time. Also not just one trace but several traces during that period, to be sure you don't miss some interesting bits.

From that trace, I can also only say that there don't seem to be any weird threads running. But as you said, there are no more threads coming up (specifically during the mem increase period), the threads are likely not the issue. Also, as there are no zombie procs, or too many subprocs, nor too many open files, I don't see any indication that the mem leak is related to this PR here. I still wonder why you got this idea that it might be related. Or this abnormally high memory usage which you observe only occurs now with this PR and has never occurred before?

As the next step, I would do some actual memory profiling, to see where the memory increase occurs in the code.

In any case, I think independently of this, we can go forward with this PR here. There are some outstanding issues which I commented on.

Icemole · 2024-06-18T07:14:51Z

Definitely agreed, this was off-topic.

@albertz I've fixed your feedback.

michelwi · 2024-06-18T12:52:27Z

sisyphus/simple_linux_utility_for_resource_management_engine.py

-        out, err = p.communicate(input=send_to_stdin, timeout=30)
+        try:
+            p = subprocess.run(system_command, input=send_to_stdin, capture_output=True, timeout=30)
+        except subprocess.TimeoutExpired:


If we catch the subprocess.TimeoutExpired exception here and return, then the whole logic about retrying in submit_helper

sisyphus/sisyphus/simple_linux_utility_for_resource_management_engine.py

Lines 239 to 246 in 51db853

while True:

try:

out, err, retval = self.system_call(sbatch_call)

except subprocess.TimeoutExpired:

logging.warning(self._system_call_timeout_warn_msg(command))

time.sleep(gs.WAIT_PERIOD_SSH_TIMEOUT)

continue

break

is obsolet.
If this was intentional, then maybe also remove this logic? and gs.WAIT_PERIOD_SSH_TIMEOUT would be ignored then

Ah so it is actually handled. That was my earlier question about this. So then I would not catch it. Unless there is good reason to change this old behavior about TimeoutExpired.

Yes... I hadn't noticed that it would be catched later in the call stack. See #196.

Add timeout to SSH queue scan command

e2a5ba8

If the SSH key isn't properly set, sisyphus will schedule stacked queue scan commands which will ask for a password. This will lead to subprocess buildup, which eventually crashes the manager with a `Too many open files` error.

Icemole requested review from albertz, curufinwe, christophmluscher, critias, JackTemaki, michelwi and Atticus1806 May 31, 2024 08:45

Icemole added 2 commits May 31, 2024 10:07

Chagne Popen/communicate by run

213c745

Add batch mode option to ssh command

bceb3e2

This way password requests are disabled.

albertz reviewed May 31, 2024

View reviewed changes

sisyphus/simple_linux_utility_for_resource_management_engine.py Outdated Show resolved Hide resolved

Remove stdout/err pipe

3ca38b4

Better formatting

9880e42

Avoid redundant lines of code

albertz reviewed May 31, 2024

View reviewed changes

sisyphus/simple_linux_utility_for_resource_management_engine.py Outdated Show resolved Hide resolved

albertz reviewed May 31, 2024

View reviewed changes

sisyphus/son_of_grid_engine.py Outdated Show resolved Hide resolved

albertz reviewed May 31, 2024

View reviewed changes

sisyphus/simple_linux_utility_for_resource_management_engine.py Outdated Show resolved Hide resolved

albertz reviewed May 31, 2024

View reviewed changes

sisyphus/son_of_grid_engine.py Outdated Show resolved Hide resolved

albertz mentioned this pull request May 31, 2024

Too many open file descriptors #190

Closed

Icemole added 3 commits May 31, 2024 16:17

Capture output on subprocess

d639048

Apply fix for AWS and load sharing engines

7679653

Input instead of stdin

60fe641

Icemole requested a review from albertz May 31, 2024 16:21

Revert timeout of subprocess to 30 seconds

2b67c81

albertz approved these changes Jun 13, 2024

View reviewed changes

albertz reviewed Jun 13, 2024

View reviewed changes

Catch subprocess.TimeoutExpired

f5fdfc7

albertz reviewed Jun 13, 2024

View reviewed changes

sisyphus/simple_linux_utility_for_resource_management_engine.py Show resolved Hide resolved

Properly return when timeout is expired

a94365a

Icemole merged commit 51db853 into master Jun 18, 2024
3 checks passed

Icemole deleted the too-many-open-files-fix branch June 18, 2024 07:15

michelwi reviewed Jun 18, 2024

View reviewed changes

albertz mentioned this pull request Jul 3, 2024

Catch subprocess.TimeoutError once #196

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add timeout to SSH queue scan command #191

Add timeout to SSH queue scan command #191

Icemole commented May 31, 2024 •

edited

Loading

albertz commented May 31, 2024

albertz commented May 31, 2024 •

edited

Loading

Icemole commented May 31, 2024

albertz commented May 31, 2024

albertz commented May 31, 2024

Icemole commented May 31, 2024 •

edited

Loading

albertz commented May 31, 2024 •

edited

Loading

critias commented May 31, 2024

Icemole commented May 31, 2024

albertz commented May 31, 2024

Icemole commented Jun 3, 2024

Icemole commented Jun 13, 2024

albertz commented Jun 13, 2024

albertz Jun 13, 2024

Icemole Jun 13, 2024

albertz Jun 13, 2024

Icemole Jun 13, 2024

albertz Jun 17, 2024

albertz commented Jun 13, 2024

albertz commented Jun 13, 2024

Icemole commented Jun 13, 2024

Icemole commented Jun 14, 2024 •

edited

Loading

albertz commented Jun 14, 2024

Icemole commented Jun 14, 2024

albertz commented Jun 14, 2024

Icemole commented Jun 17, 2024

albertz commented Jun 17, 2024 •

edited

Loading

Icemole commented Jun 18, 2024

michelwi Jun 18, 2024

albertz Jun 18, 2024

Icemole Jul 3, 2024

	while True:
	try:
	out, err, retval = self.system_call(sbatch_call)
	except subprocess.TimeoutExpired:
	logging.warning(self._system_call_timeout_warn_msg(command))
	time.sleep(gs.WAIT_PERIOD_SSH_TIMEOUT)
	continue
	break

Add timeout to SSH queue scan command #191

Add timeout to SSH queue scan command #191

Conversation

Icemole commented May 31, 2024 • edited Loading

albertz commented May 31, 2024

albertz commented May 31, 2024 • edited Loading

Icemole commented May 31, 2024

albertz commented May 31, 2024

albertz commented May 31, 2024

Icemole commented May 31, 2024 • edited Loading

albertz commented May 31, 2024 • edited Loading

critias commented May 31, 2024

Icemole commented May 31, 2024

albertz commented May 31, 2024

Icemole commented Jun 3, 2024

Icemole commented Jun 13, 2024

albertz commented Jun 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

albertz commented Jun 13, 2024

albertz commented Jun 13, 2024

Icemole commented Jun 13, 2024

Icemole commented Jun 14, 2024 • edited Loading

albertz commented Jun 14, 2024

Icemole commented Jun 14, 2024

albertz commented Jun 14, 2024

Icemole commented Jun 17, 2024

albertz commented Jun 17, 2024 • edited Loading

Icemole commented Jun 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Icemole commented May 31, 2024 •

edited

Loading

albertz commented May 31, 2024 •

edited

Loading

Icemole commented May 31, 2024 •

edited

Loading

albertz commented May 31, 2024 •

edited

Loading

Icemole commented Jun 14, 2024 •

edited

Loading

albertz commented Jun 17, 2024 •

edited

Loading