Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Individual node failure causes whole workflow failures due to queue draining #2

Open
ChristopherWilks opened this issue Mar 24, 2020 · 1 comment

Comments

@ChristopherWilks
Copy link
Collaborator

This has been a known issues for a while, but finally documenting it here.

The problem is that if even 1 worker on 1 node fails due to a non job-specific reason (e.g. running out of disk space) the failing worker will rapidly attempt every remaining job on the queue, which quickly leads to starving the workers on other nodes until they prematurely exit or just idle. The queue itself will either eventually recover by failed jobs being made visible again, or they will be dropped into the DLQ from which manual reloading to the main queue is required.

Initially, the parent worker would exit after all child workers exited (either cleanly or not).
Subsequent changes for running on JHPCE due to other failures changed the behavior of cluster.py, forcing potentially endless restarts to child worker processes which failed.

In the stampede2 environment, with the current requirement (3/24/2020) to write many of our small temporary files to the node's local /tmp, it's clear that we need a per node limit in the cluster.py code on worker failures. This is also true in the MARCC environment which is also constrained by local disk space (typically /dev/shm), and on AWS EC2 where the local NVMe's of the c5d instances have limited space.

The current solution is to revert back to the original version which is to quite the parent process after all child workers have exited, either by error or cleanly. This appears to be working on AWS EC2, but not on Stampede2, which needs further investigation.

@ChristopherWilks
Copy link
Collaborator Author

The current issue with stampede2 may result from jobs failing quickly because they are actually bad jobs (e.g. these accessions can't be downloaded from SRA).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant