You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This has been a known issues for a while, but finally documenting it here.
The problem is that if even 1 worker on 1 node fails due to a non job-specific reason (e.g. running out of disk space) the failing worker will rapidly attempt every remaining job on the queue, which quickly leads to starving the workers on other nodes until they prematurely exit or just idle. The queue itself will either eventually recover by failed jobs being made visible again, or they will be dropped into the DLQ from which manual reloading to the main queue is required.
Initially, the parent worker would exit after all child workers exited (either cleanly or not).
Subsequent changes for running on JHPCE due to other failures changed the behavior of cluster.py, forcing potentially endless restarts to child worker processes which failed.
In the stampede2 environment, with the current requirement (3/24/2020) to write many of our small temporary files to the node's local /tmp, it's clear that we need a per node limit in the cluster.py code on worker failures. This is also true in the MARCC environment which is also constrained by local disk space (typically /dev/shm), and on AWS EC2 where the local NVMe's of the c5d instances have limited space.
The current solution is to revert back to the original version which is to quite the parent process after all child workers have exited, either by error or cleanly. This appears to be working on AWS EC2, but not on Stampede2, which needs further investigation.
The text was updated successfully, but these errors were encountered:
The current issue with stampede2 may result from jobs failing quickly because they are actually bad jobs (e.g. these accessions can't be downloaded from SRA).
This has been a known issues for a while, but finally documenting it here.
The problem is that if even 1 worker on 1 node fails due to a non job-specific reason (e.g. running out of disk space) the failing worker will rapidly attempt every remaining job on the queue, which quickly leads to starving the workers on other nodes until they prematurely exit or just idle. The queue itself will either eventually recover by failed jobs being made visible again, or they will be dropped into the DLQ from which manual reloading to the main queue is required.
Initially, the parent worker would exit after all child workers exited (either cleanly or not).
Subsequent changes for running on JHPCE due to other failures changed the behavior of cluster.py, forcing potentially endless restarts to child worker processes which failed.
In the stampede2 environment, with the current requirement (3/24/2020) to write many of our small temporary files to the node's local
/tmp
, it's clear that we need a per node limit in the cluster.py code on worker failures. This is also true in the MARCC environment which is also constrained by local disk space (typically /dev/shm), and on AWS EC2 where the local NVMe's of the c5d instances have limited space.The current solution is to revert back to the original version which is to quite the parent process after all child workers have exited, either by error or cleanly. This appears to be working on AWS EC2, but not on Stampede2, which needs further investigation.
The text was updated successfully, but these errors were encountered: