-
-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subprocess killed by OOM killer leaves the script hanging forever #230
Comments
I've been having a lot of similar issues with some long running jobs, where I think a subprocess dies (it's hard to say this is always it, I have seen OOM errors, I have also seen errors like It's easy to detect when this happens, e.g. by feeding the pool using apipe and detecting when no futures have completed in a while. So perhaps detecting this situation is easier than you suspect. I've tried to code a watchdog based on that, and the watchdog code is triggered, but when I try to recover by calling pool.close()
pool.join()
pool.clear() That also hangs. Curious if anyone has found a way to recover from this sort of scenario. FWIW I tried to code up a minimal example that reflects how my code works and wasn't able to reproduce the hang by killing a subprocess manually as the OP could. If anyone has any way to 100% reproduce this situation, I'd appreciate any pointers, as that could help develop workarounds. That code: import time
import multiprocessing as mp
from pathos.multiprocessing import ProcessPool
MAX_FUTURES = 8
try:
mp.set_start_method('forkserver') # needed to avoid deadlocks on unix systems with some libraries e.g. xgboost
except RuntimeError:
pass
def do_something(i):
print(i, 'entering')
time.sleep(5)
print(i, 'returning')
return i
pool = ProcessPool(2)
i = 0
loop_cnt = 0
last_progress = loop_cnt
futures = {}
while True:
while len(futures) < MAX_FUTURES:
futures[i] = pool.apipe(do_something, i)
i += 1
completed = {}
for _i, future in futures.items():
if future.ready():
completed[_i] = future.get()
last_progress = i
for _i, result in completed.items():
print(f'got result {result}')
del futures[_i]
if last_progress < (loop_cnt - 60 // 5):
print(f'pool hung for at least 1 minute')
time.sleep(5)
loop_cnt += 1 |
I came up with a workaround that works great for my use case, whenever the pool dies for whatever reason I'm able to reliably resume from my last checkpoint. In case this is helpful to anyone else:
|
@ehoppmann would you please provide your code? |
Issue
I recently faced an issue where my script was hanging forever, after treating all jobs (which I know from the prints), with 0% CPU utilized.
I eventually understood that it was because the OOM killer had killed one of the subprocesses because it used too much RAM. A new one was immediately created but it doesn't seem used, and leads Pathos to hang forever.
Possible mitigation
I guess this is quite hard to fix, but would it be at least possible to catch this problem (a subprocess killed externally), and raise an exception in the main script, so that it does not hang forever?
To reproduce the problem
If you want to try by yourself, you can reproduce the same behaviour with the following code snippet:
During the execution, you can use
htop
to kill one of the subprocesses (last 2 lines of my screenshot). If you do this, you'll end up in a hanging state: no CPU use, but the Python script never returns.Note: type
t
inhtop
to enable tree view, then F4 to filterThe text was updated successfully, but these errors were encountered: