Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: affinity priority #1548

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

helen-frank
Copy link
Contributor

Fixes #1418

Description
Priority scheduling of pods with anti-affinity or topologySpreadConstraints
How was this change tested?
I have 10 pending pods:

pod1: 1c1g requests, with anti-affinity; cannot be scheduled on the same node as pod10 and pod9.
pod2 ~ pod8: 1c1g requests; no anti-affinity is configured.
pod9: 1c1g requests, with anti-affinity; cannot be scheduled on the same node as pod1 and pod10.
pod10: 1c1g requests, with anti-affinity; cannot be scheduled on the same node as pod1 and pod9.

I want the resources of the three nodes to be evenly distributed, like:

node1: c7a.4xlarge, 8c16g (4Pod)
node2: c7a.xlarge, 4c8g (3Pod)
node3: c7a.xlarge, 4c8g (3Pod)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 11, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: helen-frank
Once this PR has been reviewed and has the lgtm label, please assign mwielgus for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 11, 2024
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 11, 2024
Copy link

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 25, 2024
@helen-frank helen-frank changed the title [WIP] fix: affinity priority fix: affinity priority Aug 30, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 30, 2024
@helen-frank
Copy link
Contributor Author

Current Test Results:

❯ kubectl get nodeclaims
NAME            TYPE               CAPACITY   ZONE          NODE                             READY   AGE
default-8wq87   c-8x-amd64-linux   spot       test-zone-d   blissful-goldwasser-3014441860   True    67s
default-chvld   c-4x-amd64-linux   spot       test-zone-b   exciting-wescoff-4170611030      True    67s
default-kbr7n   c-2x-amd64-linux   spot       test-zone-d   vibrant-aryabhata-969189106      True    67s
❯ kubectl get pod -owide
NAME                       READY   STATUS    RESTARTS   AGE   IP           NODE                             NOMINATED NODE   READINESS GATES
nginx1-67877d4f4d-nbmj7    1/1     Running   0          77s   10.244.1.0   vibrant-aryabhata-969189106      <none>           <none>
nginx10-6685645984-sjftg   1/1     Running   0          76s   10.244.2.2   exciting-wescoff-4170611030      <none>           <none>
nginx2-5f45bfcb5b-flrlw    1/1     Running   0          77s   10.244.2.0   exciting-wescoff-4170611030      <none>           <none>
nginx3-6b5495bfff-xt7d9    1/1     Running   0          77s   10.244.2.1   exciting-wescoff-4170611030      <none>           <none>
nginx4-7bdd687bb6-nzc8f    1/1     Running   0          77s   10.244.3.5   blissful-goldwasser-3014441860   <none>           <none>
nginx5-6b5d886fc7-6m57l    1/1     Running   0          77s   10.244.3.0   blissful-goldwasser-3014441860   <none>           <none>
nginx6-bd5d6b9fb-x6lkq     1/1     Running   0          77s   10.244.3.2   blissful-goldwasser-3014441860   <none>           <none>
nginx7-5559545b9f-xs5sm    1/1     Running   0          77s   10.244.3.4   blissful-goldwasser-3014441860   <none>           <none>
nginx8-66bb679c4-zndwz     1/1     Running   0          76s   10.244.3.1   blissful-goldwasser-3014441860   <none>           <none>
nginx9-6c47b869dd-nfds6    1/1     Running   0          76s   10.244.3.3   blissful-goldwasser-3014441860   <none>           <none>

@github-actions github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 30, 2024
@coveralls
Copy link

coveralls commented Aug 30, 2024

Pull Request Test Coverage Report for Build 11357525644

Details

  • 21 of 31 (67.74%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.04%) to 80.872%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controllers/provisioning/scheduling/queue.go 4 6 66.67%
pkg/utils/pod/scheduling.go 17 25 68.0%
Totals Coverage Status
Change from base Build 11332670114: -0.04%
Covered Lines: 8511
Relevant Lines: 10524

💛 - Coveralls

Copy link
Contributor

@njtran njtran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't necessarily as clear-cut of a change to me. Is there data that you've generated to give you confidence that this doesn't have any adverse affects?

@@ -96,6 +97,15 @@ func byCPUAndMemoryDescending(pods []*v1.Pod) func(i int, j int) bool {
return true
}

// anti-affinity pods should be sorted before normal pods
if affinityCmp := pod.PodAffinityCmp(lhsPod, rhsPod); affinityCmp != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the right move, but I'm not sure how this breaks down in our bin-packing algorithm. From what I understand, this just sorts pods with affinity + tsc before others with the same exact pod requests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, after testing this approach (there is a small test case in the previous section), scheduling the mutually exclusive pods further ahead helps to get a more balanced scheduling result

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this approach, the cluster will be more stable (e.g., draining one node will not cause most pods to be rescheduled). I observed that Karpenter attempts to distribute the pods across all nodes:
Scheduler Code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @njtran @jonathan-innis , please take a look

pkg/utils/pod/scheduling.go Outdated Show resolved Hide resolved
Copy link

github-actions bot commented Oct 3, 2024

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 3, 2024
@github-actions github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 4, 2024
@njtran
Copy link
Contributor

njtran commented Oct 23, 2024

scheduling the mutually exclusive pods further ahead helps to get a more balanced scheduling result

Can you share the data that led you to this conclusion? Without going in and testing it myself, it's not clear to me how you came to this conclusion.

@helen-frank
Copy link
Contributor Author

helen-frank commented Oct 24, 2024

Current Test Results:

❯ kubectl get nodeclaims
NAME            TYPE               CAPACITY   ZONE          NODE                             READY   AGE
default-8wq87   c-8x-amd64-linux   spot       test-zone-d   blissful-goldwasser-3014441860   True    67s
default-chvld   c-4x-amd64-linux   spot       test-zone-b   exciting-wescoff-4170611030      True    67s
default-kbr7n   c-2x-amd64-linux   spot       test-zone-d   vibrant-aryabhata-969189106      True    67s
❯ kubectl get pod -owide
NAME                       READY   STATUS    RESTARTS   AGE   IP           NODE                             NOMINATED NODE   READINESS GATES
nginx1-67877d4f4d-nbmj7    1/1     Running   0          77s   10.244.1.0   vibrant-aryabhata-969189106      <none>           <none>
nginx10-6685645984-sjftg   1/1     Running   0          76s   10.244.2.2   exciting-wescoff-4170611030      <none>           <none>
nginx2-5f45bfcb5b-flrlw    1/1     Running   0          77s   10.244.2.0   exciting-wescoff-4170611030      <none>           <none>
nginx3-6b5495bfff-xt7d9    1/1     Running   0          77s   10.244.2.1   exciting-wescoff-4170611030      <none>           <none>
nginx4-7bdd687bb6-nzc8f    1/1     Running   0          77s   10.244.3.5   blissful-goldwasser-3014441860   <none>           <none>
nginx5-6b5d886fc7-6m57l    1/1     Running   0          77s   10.244.3.0   blissful-goldwasser-3014441860   <none>           <none>
nginx6-bd5d6b9fb-x6lkq     1/1     Running   0          77s   10.244.3.2   blissful-goldwasser-3014441860   <none>           <none>
nginx7-5559545b9f-xs5sm    1/1     Running   0          77s   10.244.3.4   blissful-goldwasser-3014441860   <none>           <none>
nginx8-66bb679c4-zndwz     1/1     Running   0          76s   10.244.3.1   blissful-goldwasser-3014441860   <none>           <none>
nginx9-6c47b869dd-nfds6    1/1     Running   0          76s   10.244.3.3   blissful-goldwasser-3014441860   <none>           <none>

@njtran This is the real scheduling result I got by using kwok as provider, and creating 10 deployments (where pod1, pod9, pod10 are mutually exclusive), you can see that now the scheduling instance specification is more balanced compared to the previous one, it's 8,4,2, instead of 16,2,2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

node selection: One supper large node with many small size nodes
5 participants