Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle multiple NFS instances/disks #27

Open
julianhess opened this issue Feb 13, 2020 · 3 comments
Open

Handle multiple NFS instances/disks #27

julianhess opened this issue Feb 13, 2020 · 3 comments
Assignees
Labels
enhancement New feature or request Internal Under-the-hood or Dev-ops enhancements

Comments

@julianhess
Copy link
Collaborator

julianhess commented Feb 13, 2020

A single 4 TB persistent disk on a 4 core NFS has a maximum throughput of ~400 MB/s. The instance itself will have a network throughput of 8 gigabits per second (1000 MB/s). Recall that the NFS server also runs jobs, so for a large batch submission where we are sure that we will continuously max out a 16 core instance, its throughput would be 32 Gb/s (4000 MB/s).

Thus, for a large number of concurrent workflows, we will need to have multiple NFS servers/disks per NFS server. All disks would be mounted to the controller node/each worker node; each task would be randomly assigned to a disk. (Since Slurm allows us to preferentially assign tasks to specific nodes, it would make sense to put the most I/O intensive jobs on the NFS nodes.)

We will have to very carefully optimize how many jobs a single disk could accommodate. As currently implemented, our pipelines are CPU-bound, not I/O-bound. Localization will likely be the most I/O intensive step; I’ve ballparked localization maxing out at 2-4 concurrent samples per 4 TB drive, assuming sustained bucket throughput of 100-200 MB/s per file. Perhaps we could do something clever with staggering workflow starts so that there aren’t too many concurrent localization steps.

@julianhess julianhess added enhancement New feature or request Internal Under-the-hood or Dev-ops enhancements labels Feb 13, 2020
@agraubert
Copy link
Collaborator

Just to make sure we're on the same page, I think we had discussed two options, right?

wolF Task Spreading

Let wolF be entirely responsible for distributing jobs across the available NFS instances. Either rotate through, queuing one task/NFS or estimate the throughput needed by each task and use that to determine distribution of tasks

Canine Orchestrator Spreading

The Orchestrator can spread a single task across multiple NFS by chunking the input job spec and creating one localizer instance each

@julianhess
Copy link
Collaborator Author

Yup, those were the two options. The former option is by far the easiest, but it would preclude very large IO intensive scatter jobs.

@agraubert
Copy link
Collaborator

One of the things that has been bothering me is how we report/detect multiple nfs mounts so the orchestrator can shard a pipeline. What I think is the simplest solution is to allow the localization.staging_dir pipeline argument to also be an array. If it's an array, canine should distribute the job load across those NFS mounts (either by spreading jobs evenly or by fully loading one NFS before moving to the next).

Ie:

localization:
  staging_dir:
    - /mnt/nfs/a/
    - /mnt/nfs/b/
    - /mnt/nfs/c/

And then the orchestrator can create separate localizers for each NFS that it's using. Additionally, the --array argument to sbatch would allow us to modify the task ids so the pipeline still exists over a contiguous block of job ids

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Internal Under-the-hood or Dev-ops enhancements
Projects
None yet
Development

No branches or pull requests

2 participants