-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add timeout to SSH queue scan command #191
Conversation
If the SSH key isn't properly set, sisyphus will schedule stacked queue scan commands which will ask for a password. This will lead to subprocess buildup, which eventually crashes the manager with a `Too many open files` error.
So you add What about Or |
Also, I wonder about this. How can this happen? Are there uncleaned zombie procs? Did you check this? I think this is another separate thing you can and should fix. Edit I just checked. I think if self.returncode is None and _active is not None:
# Child is still running, keep us alive until we can wait on it.
_active.append(self) So, it means, if the proc never dies itself, it will always stay alive. To kill it, you must explicitly kill it somewhere, but it seems we never do that? Btw, if we would just use |
I think that's a much more intuitive solution than what I proposed. |
Additionally, I still would add the |
This way password requests are disabled.
sisyphus/simple_linux_utility_for_resource_management_engine.py
Outdated
Show resolved
Hide resolved
Btw, I wonder about similar wrong usages of |
Avoid redundant lines of code
sisyphus/simple_linux_utility_for_resource_management_engine.py
Outdated
Show resolved
Hide resolved
sisyphus/simple_linux_utility_for_resource_management_engine.py
Outdated
Show resolved
Hide resolved
I can do that as well. I actually envisioned joining that code into another common file because some of it is the exact same code, and we would be removing a bit of code duplication. |
I see the same wrong usage in these files:
I would not do that for now. It's basically one line of code ( |
I agree with Albert here. |
Roger that :) will implement right away. |
Looks all fine now. Except maybe this aspect:
I wonder if that make sense to reuse this setting for the timeout. I personally would have kept those decoupled. I.e. for now, leave the 30 seconds, and if you want to be able to configure that, make a separate setting for it. |
Okay, I'll change it back. I thought it made sense given the name. For now I'll change it back to 30 seconds, but upon reviewing the
|
I've been testing this, and while my manager didn't crash anymore because of too many open files, there was an abnormally high memory usage that extended throughout time whenever the subprocesses didn't successfully finish:
When messages like this one were spammed, I noticed a very high memory usage from the manager's side. I'll try to review the code a bit more. |
The code looks fine to me though (sorry, forgot to review again, did that now). You should debug this better. I.e. while you see the memory increase, is this memory increase really in the main proc of the manager? Or maybe just in some sub procs? Are the timed-out sub procs properly cleaned up (i.e. no zombie procs hanging around)? |
if send_to_stdin: | ||
send_to_stdin = send_to_stdin.encode() | ||
out, err = p.communicate(input=send_to_stdin, timeout=30) | ||
p = subprocess.run(system_command, input=send_to_stdin, capture_output=True, timeout=30) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder, don't you need to catch TimeoutExpired
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right, I'll ignore any TimeoutExpired
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what does it mean? What happens now? The whole manager crashes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I had any TimeoutExpired
exception, my error was within the timeout because of a public key mismatch, and it was actually outputted within the first second after the command was run:
[2024-06-13 07:24:26,842] INFO: Submit to queue: work/i6_core/returnn/extract_prior/ReturnnComputePriorJobV2.yYMFtGVTZgnn run [1]
[2024-06-13 07:24:26,953] ERROR: Error to submit job, return value: 255
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I realize now, in the code before your change, we also did not catch TimeoutExpired
(in the p.wait(timeout=30)
). I wonder why we did not catch it, and what happened when this exception was thrown there. Do you know? (As I understood you, you have run into this exact problem.)
Maybe we should not change this and also not catch TimeoutExpired
? @critias?
I'm testing with a simple script whether there are any problems w.r.t. memory consumption or so. Script here, or just directly here: import subprocess
while True:
try:
subprocess.run(["cat", "/dev/zero"], timeout=0.01)
except subprocess.TimeoutExpired:
print("TimeoutExpired") Just run this, and watch the memory consumption meanwhile. I don't see that there is any increase in memory. So, I don't think that this is the problem. |
What do you mean by spammed? How often do you get them? It should wait each time for the timeout, or not? |
Agreed, I panicked because I was making our head node go super slow and canceled the program 😅 I'll make a mental note to scan
It doesn't wait for the full timeout (30 seconds), but it's not spammy (as in < 1 second per print) either:
The job immediately fails because of |
Why do you actually think the memory leak is related to the PR here? It only increases when you get those "Error to submit job" and otherwise stays constant? So are there alive subprocs or not? Are there any zombie procs? (What does What are the threads? (E.g. what does Where is the memory allocated? (What does a mem profiler say?) |
Yes, it could very well be the case that there's some other error going on that's not related to the PR, but I was able to observe it after doing this change, so I've reported it here. If you want we can move this discussion to another PR. There are no zombie processes. Sadly I don't have sudo powers and ptrace is not enabled, so I can't dump stats for |
You don't need sudo for that. Just some admin has to enable ptrace. Then you can use it without ptrace. Please do that. I doesn't make sense that we waste time here by just guessing around without properly just debugging the issue. |
Thanks for the suggestion, I've already asked to enable ptrace for everyone. In the meantime, one of our admins got me the trace we were looking for: pystack_threads_tracker.log. From the log it would seem as if half of the threads (~50) were just starting up (see |
Is this the trace at the time when the problem occurs, i.e. during messages like this one were spammed, or during memory increase when watching memory usage in htop or so? It's important to get the trace at exactly this time. Also not just one trace but several traces during that period, to be sure you don't miss some interesting bits. From that trace, I can also only say that there don't seem to be any weird threads running. But as you said, there are no more threads coming up (specifically during the mem increase period), the threads are likely not the issue. Also, as there are no zombie procs, or too many subprocs, nor too many open files, I don't see any indication that the mem leak is related to this PR here. I still wonder why you got this idea that it might be related. Or this abnormally high memory usage which you observe only occurs now with this PR and has never occurred before? As the next step, I would do some actual memory profiling, to see where the memory increase occurs in the code. In any case, I think independently of this, we can go forward with this PR here. There are some outstanding issues which I commented on. |
Definitely agreed, this was off-topic. @albertz I've fixed your feedback. |
out, err = p.communicate(input=send_to_stdin, timeout=30) | ||
try: | ||
p = subprocess.run(system_command, input=send_to_stdin, capture_output=True, timeout=30) | ||
except subprocess.TimeoutExpired: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we catch the subprocess.TimeoutExpired
exception here and return, then the whole logic about retrying in submit_helper
sisyphus/sisyphus/simple_linux_utility_for_resource_management_engine.py
Lines 239 to 246 in 51db853
while True: | |
try: | |
out, err, retval = self.system_call(sbatch_call) | |
except subprocess.TimeoutExpired: | |
logging.warning(self._system_call_timeout_warn_msg(command)) | |
time.sleep(gs.WAIT_PERIOD_SSH_TIMEOUT) | |
continue | |
break |
If this was intentional, then maybe also remove this logic? and
gs.WAIT_PERIOD_SSH_TIMEOUT
would be ignored then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah so it is actually handled. That was my earlier question about this. So then I would not catch it. Unless there is good reason to change this old behavior about TimeoutExpired
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes... I hadn't noticed that it would be catched later in the call stack. See #196.
If the SSH key isn't properly set, sisyphus will schedule stacked queue scan commands which will ask for a password. This will lead to subprocess buildup, which eventually crashes the manager with a
Too many open files
error.I've also changed the hardcoded 30 second interval for an interval depending on
gs.WAIT_PERIOD_BETWEEN_CHECKS
, which I find makes more sense. <- Edit: not anymore, see below.Fix #190.