You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened:
Suppose system has CQs sharing a Cohort. Two CQs are trying to admit a workload. These CQs are lending capacity and configure reclaimWithinCohort=any.
Suppose WL1 requests fit within nominal capacity of CQ1, but the sum of the running workloads' plus new workload's requests surpass the nominal + borrowing limit of CQ1.
Suppose WL2 fits within the nominal capacity of CQ2, even considering other running workloads in CQ2. Suppose that excess capacity of CQ2 is being lend out to, and used by other ClusterQueues in the Cohort, so that CQ2 needs to issue preemptions to reclaim its nominal quota.
WL1 will be considered not borrowing (code). If WL1 was created before WL2 - and priority sorting/fair sharing are disabled - it will be processed first in a scheduling cycle (code).
It may end up reserving capacity in the Cohort (code), which WL2 is depending on to be able to schedule (code). WL2 is blocked indefinitely, unable to issue preemptions until WL1 successfully schedules.
What you expected to happen:
Even without FairSharing enabled, WL2 should be sorted before WL1 and able to issue preemptions immediately, since it fits within nominal capacity without borrowing required.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
FairSharing should solve this problem, as CL1+WL1 will have a higher DominantResourceShare than CL2+WL2
Environment:
Kueue version: 0.8.1
The text was updated successfully, but these errors were encountered:
What happened:
Suppose system has CQs sharing a Cohort. Two CQs are trying to admit a workload. These CQs are lending capacity and configure
reclaimWithinCohort=any
.Suppose WL1 requests fit within nominal capacity of CQ1, but the sum of the running workloads' plus new workload's requests surpass the nominal + borrowing limit of CQ1.
Suppose WL2 fits within the nominal capacity of CQ2, even considering other running workloads in CQ2. Suppose that excess capacity of CQ2 is being lend out to, and used by other ClusterQueues in the Cohort, so that CQ2 needs to issue preemptions to reclaim its nominal quota.
WL1 will be considered not borrowing (code). If WL1 was created before WL2 - and priority sorting/fair sharing are disabled - it will be processed first in a scheduling cycle (code).
It may end up reserving capacity in the Cohort (code), which WL2 is depending on to be able to schedule (code). WL2 is blocked indefinitely, unable to issue preemptions until WL1 successfully schedules.
What you expected to happen:
Even without FairSharing enabled, WL2 should be sorted before WL1 and able to issue preemptions immediately, since it fits within nominal capacity without borrowing required.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
FairSharing should solve this problem, as CL1+WL1 will have a higher DominantResourceShare than CL2+WL2
Environment:
The text was updated successfully, but these errors were encountered: