Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc14: specifying a request for uniform distribution of total child resources across unbounded parent resources #222

Open
SteVwonder opened this issue Jan 29, 2020 · 0 comments

Comments

@SteVwonder
Copy link
Member

SteVwonder commented Jan 29, 2020

Imagine a scenario where you need X child resources in total, you don't care how many parent resources they are spread across but you need the same number of children allocated per parent.

Two concrete request examples and valid allocations:

  • A user has an MPI application that uses OpenMP for on-node parallelism and distributes a uniform amount of work to each MPI rank. Thus, they need 100 cores (or GPUs), don't care how many nodes, but the same number of cores allocated per node.
    • Valid allocations: 10 nodes with 10 cores per node, 20 nodes with 5 cores per node, etc
    • Invalid allocation: 2 nodes with 48 cores per node, 1 node with 4 cores
  • A user has an in-memory database that requires a fixed amount of memory and the database assumes each node has the same amount of memory. Thus, they need 100 TB of memory, don't care how many nodes, but the same amount of memory allocated per node.
    • Valid allocations: 10 nodes with 10 TB of memory per node, 20 nodes with 5 TB of memory per node, etc
    • Invalid allocation: 2 nodes with 48 TB memory per node, 1 node with 4 TB of memory

There are really two problems here:

  1. Need some way to specify the total count (ignore multiplicative effects of the with key) of a particular child resource (currently only possible with tasks)
  2. Need some way to specify that the child resource should be allocated uniformly across the parent resource

My best attempt to summarize a discussion with @trws:

  • One thought for solving problem 1 was to add a new label to the resource besides count, something like total-count
  • Building on that, to solve problem 2, we could add another label like total-count-spread (or something else) which could have values of uniform and non-uniform.

This seemed kinda gross and hacky. It also opens up all sorts of questions about sibling resources and how to actually implement this functionality in the scheduler.

The idea that we hated the least was to add an alternative to the with key: across. You would start by specifying the "child" resource that you want a total count of, then use the across key to describe the parent (or subtree) that you want the children uniformly spread across. Strawman example:

resources:
  - type: core
    count: 100
    across:
        type: slot
        count: {min: 1}
        with:
        - type: node
          count: 1

I am still trying to figure out if sibling resources make sense in the subtree under the across key. I don't think they do. Maybe it makes sense to invert the entire request and keep using across all the way down. Anyway, one nice part about this description, is that as an implementation, you know from the start that this isn't a normal match/traversal. You are attempting to find a total number of child resources and spread them uniformly across the parent resources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant