Individual node failure causes whole workflow failures due to queue draining #2

ChristopherWilks · 2020-03-24T16:46:49Z

This has been a known issues for a while, but finally documenting it here.

The problem is that if even 1 worker on 1 node fails due to a non job-specific reason (e.g. running out of disk space) the failing worker will rapidly attempt every remaining job on the queue, which quickly leads to starving the workers on other nodes until they prematurely exit or just idle. The queue itself will either eventually recover by failed jobs being made visible again, or they will be dropped into the DLQ from which manual reloading to the main queue is required.

Initially, the parent worker would exit after all child workers exited (either cleanly or not).
Subsequent changes for running on JHPCE due to other failures changed the behavior of cluster.py, forcing potentially endless restarts to child worker processes which failed.

In the stampede2 environment, with the current requirement (3/24/2020) to write many of our small temporary files to the node's local /tmp, it's clear that we need a per node limit in the cluster.py code on worker failures. This is also true in the MARCC environment which is also constrained by local disk space (typically /dev/shm), and on AWS EC2 where the local NVMe's of the c5d instances have limited space.

The current solution is to revert back to the original version which is to quite the parent process after all child workers have exited, either by error or cleanly. This appears to be working on AWS EC2, but not on Stampede2, which needs further investigation.

The text was updated successfully, but these errors were encountered:

ChristopherWilks · 2020-03-24T16:48:02Z

The current issue with stampede2 may result from jobs failing quickly because they are actually bad jobs (e.g. these accessions can't be downloaded from SRA).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Individual node failure causes whole workflow failures due to queue draining #2

Individual node failure causes whole workflow failures due to queue draining #2

ChristopherWilks commented Mar 24, 2020

ChristopherWilks commented Mar 24, 2020

Individual node failure causes whole workflow failures due to queue draining #2

Individual node failure causes whole workflow failures due to queue draining #2

Comments

ChristopherWilks commented Mar 24, 2020

ChristopherWilks commented Mar 24, 2020