Handle multiple NFS instances/disks #27

julianhess · 2020-02-13T02:32:29Z

A single 4 TB persistent disk on a 4 core NFS has a maximum throughput of ~400 MB/s. The instance itself will have a network throughput of 8 gigabits per second (1000 MB/s). Recall that the NFS server also runs jobs, so for a large batch submission where we are sure that we will continuously max out a 16 core instance, its throughput would be 32 Gb/s (4000 MB/s).

Thus, for a large number of concurrent workflows, we will need to have multiple NFS servers/disks per NFS server. All disks would be mounted to the controller node/each worker node; each task would be randomly assigned to a disk. (Since Slurm allows us to preferentially assign tasks to specific nodes, it would make sense to put the most I/O intensive jobs on the NFS nodes.)

We will have to very carefully optimize how many jobs a single disk could accommodate. As currently implemented, our pipelines are CPU-bound, not I/O-bound. Localization will likely be the most I/O intensive step; I’ve ballparked localization maxing out at 2-4 concurrent samples per 4 TB drive, assuming sustained bucket throughput of 100-200 MB/s per file. Perhaps we could do something clever with staggering workflow starts so that there aren’t too many concurrent localization steps.

agraubert · 2020-02-19T16:30:27Z

Just to make sure we're on the same page, I think we had discussed two options, right?

wolF Task Spreading

Let wolF be entirely responsible for distributing jobs across the available NFS instances. Either rotate through, queuing one task/NFS or estimate the throughput needed by each task and use that to determine distribution of tasks

Canine Orchestrator Spreading

The Orchestrator can spread a single task across multiple NFS by chunking the input job spec and creating one localizer instance each

julianhess · 2020-02-19T18:38:02Z

Yup, those were the two options. The former option is by far the easiest, but it would preclude very large IO intensive scatter jobs.

agraubert · 2020-03-05T19:08:02Z

One of the things that has been bothering me is how we report/detect multiple nfs mounts so the orchestrator can shard a pipeline. What I think is the simplest solution is to allow the localization.staging_dir pipeline argument to also be an array. If it's an array, canine should distribute the job load across those NFS mounts (either by spreading jobs evenly or by fully loading one NFS before moving to the next).

Ie:

localization:
  staging_dir:
    - /mnt/nfs/a/
    - /mnt/nfs/b/
    - /mnt/nfs/c/

And then the orchestrator can create separate localizers for each NFS that it's using. Additionally, the --array argument to sbatch would allow us to modify the task ids so the pipeline still exists over a contiguous block of job ids

julianhess added enhancement New feature or request Internal Under-the-hood or Dev-ops enhancements labels Feb 13, 2020

julianhess assigned julianhess and agraubert Feb 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle multiple NFS instances/disks #27

Handle multiple NFS instances/disks #27

julianhess commented Feb 13, 2020 •

edited

Loading

agraubert commented Feb 19, 2020

julianhess commented Feb 19, 2020

agraubert commented Mar 5, 2020

Handle multiple NFS instances/disks #27

Handle multiple NFS instances/disks #27

Comments

julianhess commented Feb 13, 2020 • edited Loading

agraubert commented Feb 19, 2020

wolF Task Spreading

Canine Orchestrator Spreading

julianhess commented Feb 19, 2020

agraubert commented Mar 5, 2020

julianhess commented Feb 13, 2020 •

edited

Loading