You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is not specifically a problem with the recount-pump workflow as it is with many of the compute environments it's run in.
MARCC (/dev/shm), AWS EC2 (using c5d instances with local NVMe's), and now Stampede2 (/tmp) all require most of the temporary files to be written to space-constrained filesystems, local to the node where the workflow is executing.
Since the temporary file size of any given run accession (job) varies widely, from 100's of MBs to many GBs, and our job scheduling is primitive, we need to track the temporary disk space usage of a given node while the workflow is running on it.
The problem primarily stems from the following aspects of the workflow:
The need to efficiently utilize typical nodes by running concurrent processing jobs on them, which increases the total temporary space needed at any time
Failed jobs leave temporary files behind
Temporary files are typically large (FASTQs/BAMs)
We therefore need a way to 1) track temporary disk space usage and 2) continuously monitor for and remove failed jobs' temporary files, to at least partially address this issue.
The text was updated successfully, but these errors were encountered:
This is not specifically a problem with the recount-pump workflow as it is with many of the compute environments it's run in.
MARCC (/dev/shm), AWS EC2 (using c5d instances with local NVMe's), and now Stampede2 (/tmp) all require most of the temporary files to be written to space-constrained filesystems, local to the node where the workflow is executing.
Since the temporary file size of any given run accession (job) varies widely, from 100's of MBs to many GBs, and our job scheduling is primitive, we need to track the temporary disk space usage of a given node while the workflow is running on it.
The problem primarily stems from the following aspects of the workflow:
We therefore need a way to 1) track temporary disk space usage and 2) continuously monitor for and remove failed jobs' temporary files, to at least partially address this issue.
The text was updated successfully, but these errors were encountered: