Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add consolidation policies rfc #1433

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 72 additions & 0 deletions designs/consolidation-policies.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# Consolidation Policies

## Problem Statement

Karpenter provides 3 forms of consolidation: single-node, multi-node, and emptiness consolidation. Single-node replaces expensive nodes with cheaper nodes that satisfy the workloads requirements. Multi-node replaces multiple nodes with a single, larger node that can satisfy workload requirements. Emptiness removes nodes that have no workloads that need to be relocated to reduce costs.

Customers want to have control over the types of consolidation that can occur within a nodepool. Some nodepools may only be used for job type workloads (crons, jobs, spark, etc) where single-node consolidation is dangerous and wastes execution when it disrupts pods. Other nodepools may be used mostly by long-running daemons, where daemonset costs are more significant and multi-node consolidation to binpack into larger nodes would be preferred.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, multi and single node are just collapsed into the same type of consolidation, where users don't need to think about the fact that there are two concepts for consolidation.


Karpenter's consolidation considers the price of instances when deciding to consolidate, but does not consider the cost of disruption. This can harm cluster stability, if the price improvement of the node is small, or the workloads disrupted are very costly to rebuild to their previous running state (e.g. long-running job that must restart from scratch).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion I'd make is that we estimate the cost of disruption, and there are some (albeit lesser used) knobs that a user can use to tweak this in their favor.


Karpenter's consolidation is an all or nothing offering today. Nodepools can disable consolidation and let only empty nodes be removed, or customers can opt into all forms of consolidation.

## Design

### Option 1: Add price thresholds, merge consolidation controllers
The motivation for the additional controls is the high cost of disruption of certain workloads for small or negative cost savings. So instead of offering controls, the proposal is to add price improvement thresholds for consolidation, to avoid this scenario.

The proposal is to find a price improvement threshold which accurate represents the "cost" of disrupting the workloads. This threshold could be an arbitrary-fixed value (e.g. 20% price improvement), or heuristically computed by Karpenter based on the cluster shape (e.g. nodepool ttl's might be a heuristic input).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, what do you think a reasonable default could be for this?


The separation of the single-node and multi-node consolidation is arbitrary. The single-node consolidation finds the "least costly" eligible node to disrupt, and uses that as a solution, even if it's a poor solution. However, the least-costly eligible node to disrupt is not necessarily the most cost-savings node to disrupt, but the single-node consolidation stops after it has any solution.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the single-node consolidation stops after it has any solution.

This is the main difference from multi-node today. Multi-node doesn't (and can't) go through all the potential options that single-node can, so it finds edge cases of cost savings that multi-node doesn't


In contrast, the multi-node consolidation spends as much time as possible to find a solution. So while price thresholds could be implemented into each consolidation controller, it is simpler to consider merging their responsabilities into one shared controller based on multi-node consolidation. This combined consolidation controller could search as long as possible for the most effective consolidation outcome, from emptiness, multi-node or single-node replacements.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-node has a timeout of 1 minute: https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/disruption/multinodeconsolidation.go#L35, single node has 3 minutes.

Suggested change
In contrast, the multi-node consolidation spends as much time as possible to find a solution. So while price thresholds could be implemented into each consolidation controller, it is simpler to consider merging their responsabilities into one shared controller based on multi-node consolidation. This combined consolidation controller could search as long as possible for the most effective consolidation outcome, from emptiness, multi-node or single-node replacements.
So while price thresholds could be implemented into each consolidation controller, it is simpler to consider merging their responsabilities into one shared controller based on multi-node consolidation. This combined consolidation controller could search as long as possible for the most effective consolidation outcome, from emptiness, multi-node or single-node replacements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This combined consolidation controller could search as long as possible for the most effective consolidation outcome

We've seen this break down at larger clusters, with complex scheduling constraints, for instance preferred anti affinity, where Karpenter has to try to schedule pods multiple times, once with the constraints treated as a hard-constraint, and once as a soft-constraint. We may be able to achieve a performance benefit here through hardening our story on preferences though.


#### Pros/Cons
* 👍👍 No new configuration for the `NodePool` crd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's no config for the threshold, I imagine there's a default you're picking on behalf of the user?

* ~ More complex implementation for consolidation, but easier to understand (only 1 controller for consolidation).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that this is ambiguous. Changing the algo in drastic ways can have unexpected adverse effects for a subset of users.

* 👎 It's possible a one-size fits all price threshold will satisfy some customers and not others
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, we can take this approach as an alpha feature, and then include config if one-size truly doesn't work out

* 👎 High amount of engineering effort to implement.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this as the same as point 2


### Option 2: Refactor and Expand Consolidation Policies
Another approach would be to not make significant changes to consolidation, but expose controls to allow enabling or disabling various consolidation controllers. Karpenter already presents the binary consolidation options. The `NodePool` resource has a disruption field that exposes the `consolidationPolicy` field. This field is an enum with two supported values: `WhenEmpty` and `WhenUnderutilized`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slightly updated since

Suggested change
Another approach would be to not make significant changes to consolidation, but expose controls to allow enabling or disabling various consolidation controllers. Karpenter already presents the binary consolidation options. The `NodePool` resource has a disruption field that exposes the `consolidationPolicy` field. This field is an enum with two supported values: `WhenEmpty` and `WhenUnderutilized`.
Another approach would be to not make significant changes to consolidation, but expose controls to allow enabling or disabling various consolidation controllers. Karpenter already presents the binary consolidation options. The `NodePool` resource has a disruption field that exposes the `consolidationPolicy` field. This field is an enum with two supported values: `WhenEmpty` and `WhenEmptyOrUnderutilized`.


The proposal is to add two new consolidation policy enum values:

* `WhenCheaper`
* `WhenUnderutilizedOrCheaper`

And change the semantics of the existing consolidation policy enum value:

* `WhenUnderutilized`
Comment on lines +38 to +40
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the semantic of the existing consolidation policy to resolve to something different is a breaking change, and could definitely surprise people who upgrade without changing their values to the right one. If we're going this route, I'd prefer we keep the existing enum val to mean the same thing, and add another one to portray the mode you're proposing


The semantics of each of the enum values matches their names, making it straightforward to explain to customers or users. `WhenUnderutilizedOrCheaper` becomes the new default. `WhenUnderutilizedOrCheaper` is the same as `WhenUnderutilized` previously. `WhenUnderutilized` semantics changes to only allow emptiness or multi-node consolidation to occur for a nodepool. `WhenCheaper` only allows emptiness or single-node consolidation to occur for a nodepool.

#### Pros/Cons
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another con: more effort to understand the three different modes for users

* 👍 Simple configuration for the `NodePool` crd
* 👍 Simple implementation for the various consolidation controllers.
* 👎 If Karpenter adds consolidation modes with new purposes, more `consolidationPolicy` enum values will be needed to make them toggleable.

### Option 3: Add Consolidation Configuration to NodePools
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if we can do this in a backwards compatible way that wouldn't require a v2 to the API.

Karpenter's `NodePool` resource could expose fine-grained consolidation controls to allow configuring each controller, as well as controller-specific subfields (e.g multi-node consolidation's `maxParallel` setting).

A possible `NodePool` partial resource might look like this:
```yaml
spec:
disruption:
consolidation:
singleNode:
enabled: true
multiNode:
enabled: true
maxParallel: 100
emptiness:
enabled: true
emptyThreshold: "30%"
```

#### Pros/Cons
* 👍 Fine-grained controls for the various consolidation controllers.
* 👎 Complex configuration for the `NodePool` crd
* 👎 Complex implementation for the various consolidation controllers.
* 👎 Limits Karpenter's ability to change / remove existing consolidation controllers, as config will be tightly coupled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty large downside for me


Loading