-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidation ttl: spec.disruption.consolidateAfter
#735
Comments
We've talked about this a fair bit -- I think it should be combined/collapsed w/ ttlSecondsAfterEmpty. |
The challenge with this issue is more technical than anything. Computing ttlSecondsAfterEmpty is cheap, since we can cheaply compute empty nodes. Computing a consolidatable node requires a scheduling simulation across the rest of the cluster. Computing this for all nodes is really computationally expensive. We could potentially compute this once on the initial scan, and again once the TTL is about to expire. However, this can lead to weird scenarios like:
The only way to get the semantic to be technically correct is to recompute the consolidatability for the entire cluster on every single pod creation/deletion. The algorithm described above is a computationally feasible way (equivalent to current calculations), but has weird edge cases. Would you be willing to accept those tradeoffs? |
I'm a little unclear on this and I think it's in how I'm reading not in what you've said. What I think I am reading is that running the consolidatability on every single pod creation/deletion is to expensive. As an alternative the algorithm above is acceptable but in some cases could result in node consolidation in 'less than' TTLSecondsAfterConsolidatable due to fluctuation in cluster capacity between initial check (t0) and confirmation check/ (t0+30s in the example). Have I understood correctly? |
Yeah exactly. Essentially, the TTL wouldn't flip flop perfectly. We'd be taking a rough sample (rather than a perfect sample) of the data. |
Thanks for the clarity. For my usage I'd not be concerned about the roughness of the sample. As long as there was a configurable time frame and the confirmation check needed to pass both times I'd be satisfied. What I thought I wanted before being directed to this issue was to be able to specify how the consolidator was configured a bit like the descheduler project because I'm not really sure if the 'if it fits it sits' approach to scheduling is what I need in all cases. |
Specifically, what behavior of descheduler did you want? |
Generally I was looking for something like the |
To give another example of this need, I have a cluster that runs around 1500 pods - there are lots of pods coming in and out at any given moment. It would be great to be able to specify a consolidation cooldown period so that we are not constantly adding/removing nodes. Cluster Autoscaler has the flag |
is it feature available yet? |
We are facing same issue with high node rotation due too aggressive consolidation, would be nice to tune and control the behaviour, like minimum node ttl liveness, thresshold ttl since it's empty or underutilisation, merging nodes |
cluster-autoscaler has other options too like:
I'm looking forward to something like |
Another couple of situations that currently lead to high node churn are:
In both situation above, we end up in situations where some workloads will end up being restarted multiple times within a short time frame due to node churn and if not enough replicas are configured with sufficient anti-affinity/skew, there is a chance for downtime to occur while pods become ready once again on new nodes. It would be nice to be able to control the consolidation period, say every 24 hours or every week as described by the OP so it's less disruptive. Karpenter is doing the right thing though! I suspect some workarounds could be:
Any other ideas or suggestions appreciated. |
Adding here as another use case where we need better controls over consolidation, esp. around utilization. For us, there's a trade-off between utilization efficiency and disruptions caused by pod evictions. For instance, let's say I have 3 nodes, each utilized at 60%, so current behavior is Karpenter will consolidate down to 2 nodes at 90% capacity. But, in some cases, evicting the pods on the node to be removed is more harmful than achieving optimal utilization. It's not that these pods can't be evicted (for that we have the do-not-drain annotation) it's just that it's not ideal ... good example would be Spark executor pods that while they can recover from a restart, it's better if they are allowed to finish their work at the expense of some temporary inefficiency in node utilization. CAS has the |
@thelabdude can't your pods set |
I'll have to think about termination grace period could help us but I wouldn't know what value to set and it would probably vary by workload ... My point was more, I'd like better control over the consolidate decision with Karpenter. If I have a node hosting expensive pods (in terms of restart cost), then a node running at 55% utilization (either memory / cpu) may be acceptable in the short term even if the ideal case is to drain off the pods on that node to reschedule on other nodes. Cluster Auto-scaler provides this threshold setting and doesn't require special termination settings on the pods. I'm not saying a utilization threshold is the right answer for Karpenter but the current situation makes it hard to use in practice because we get too much pod churn due to consolidation and our nodes are never empty, so turning consolidation off isn't a solution either. |
Hey @thelabdude, this is a good callout of core differences in CA's deprovisioning and Karpenter's deprovisioning. Karpenter intentionally has chosen to not use a threshold, as for any threshold you create, due to the heterogenous nature of pod resource requests, you can create un-wanted edge cases that constantly need to be fine-tuned. For more info, ConsolidationTTL here would simply act as a waiting mechanism between consolidation actions, which you can read more about here. Since this would essentially just be a wait, this will simply slow down the time Karpenter takes to get to the end state as you've described. One idea that may help is if Karpenter allows some configuration of the cost-benefit analysis that Consolidation does. This would need to be framed as either cost or utilization, both tough to get right. If you're able to in the meantime, you can set |
Are there any plans to implement or accept such a feature that adds some sort of time delay between node provisioning and consolidation? Perhaps based on the age of a node? The main advantage would be to increase stability during situations where there are surges in workload (scaling, scheduling, or roll outs). |
Hey can you just add delay before start consolidation after prod's change? You can add several delays:
This will help to run consolidation during low activity on cluster. |
Also see issue #696: Exponential decay for cluster desired size |
This comment suggests another approach we might consider.
(from #735) Elsewhere in Kubernetes, ReplicaSets can pay attention to a Pod deletion cost. For Karpenter, we could have a Machine or Node level deletion cost, and possibly a contrib controller that raises that cost based on what is running there. Imagine that you have a controller that detects when Pods are bound to a Node, and updates the node deletion cost based on some quality of the Pod. For example: if you have a Pod annotated as |
It seems the solution was already implemented in this PR but it hasn't been merged yet 😞 |
Ah I see! Thanks for linking me to it :) glad this is just now pending a merge, which hopefully will come soon! |
Hey all, after some more design discussion, we've decided not to go with the approach initially taken in Alternatively, the approach we're now running with for v1 is using Let us know what you think of this approach. |
@njtran please clarify: does it mean that starting with Is there any eta for a Karpenter v1 API release with which the above config can actually be used? Thank you. |
@vainkop I mean when this is implemented, if you set |
Are there any plans to support consolidation constraints that do not depend on pod scheduling at all, and instead depend on total node lifetime? I think relying solely on time since pod scheduling occurred could result in these issues, and may not meet our needs as cluster administrators.
|
I do support and think having a consolidation control that's based on pod scheduling is a good idea, but I think it may fall short of meeting all use cases that are asking for better consolidation control. Perhaps having additional fields that limited consolidation would cover all use cases better. I think something like this is what we'd likely want to use if each of these fields were available.
|
Totally agree here. I think a minimum age would be better solved through the
On these two configurations, I wonder if this would be better prescribed with some sort of node headroom, static capacity. This is really to combat the case where a set of job pods go down, and you want to reserve the capacity for some period of time because you know that pods will come back soon? |
We've discussed a |
Are you suggesting that I write a controller to set the do-not-disrupt annotation based on node age myself, or that Karpenter would support this through use of the do-not-disrupt annotation under the hood as an implementation option?
Sure, but are we considering if I want to block consolidation disruption for nodes <6h old, but drift nodes immediately regardless of age? Implementing this as an isolated consolidation control may be the best way forward. |
I was of course more interested in the new functionality solving the issue we're discussing rather than smth that presents "the same behavior as before this was implemented"... |
This would be a feature supported natively through Karpenter: linked the issue here #752.
Makes sense here, we sort of follow this with our 15s validation period in between finding and executing consolidation. We could consider making this configurable in the future too. |
It may be worth going forward the proposed approach to see how it holds up in the real world. At first glance, it seems like consolidation based on pod scheduling may not yield desired results, at least in our use case. The trigger to delete nodes appears to more of a "recentlyUtilized" condition (as opposed to Here's one scenario:
I would expect that at some point the cluster will reach some state of equilibrium. Unclear if there may be runaway situations where one cold end up with several under utilised nodes. Similar to what @Shadowssong mentioned, we've found that this sort of configuration may work as a tactical approach, especially in nonprod clusters:
It reduces node churn by allowing a short window for karpenter to run it's consolidation cycle. This, in turn, reduces the amount of disruption over the past week or so with (SPOT) nodes now showing significantly longer lifetimes/ages. |
Totally makes sense. I imagine you would tune your |
Another point worth noting for the approach mentioned in #735 (comment): Karpenter prioritizes consolidating nodes that have the least amount of pods scheduled. Compare this to the default plug-in for kube-scheduler, which will schedule pods to the nodes that are the least allocated. Generally, this means that Karpenter's heuristic is least compatible with the kube-scheduler default, scheduling pods to nodes that Karpenter wants to disrupt. If the kube-scheduler was configured to |
Implementing a maintenance window is undesired for us because we do not want to create special periods of time where disruption can occur vs not. This makes on-call awkward, increases the complexity of information we have to communicate to users of our clusters, and requires unique configuration of our NodePools for each region we operate in. I'd also like to note that we need to be able to configure consolidation controls separate for consolidation types in some cases. We've currently implemented a patch to disable single-node consolidation altogether after finding it wasn't providing much value in return for the large amounts of disruption to our clusters. Given there's a tangible $$$ cost to restarting applications in some cases, it's entirely possible that single-node consolidation wastes more money than it saves with a naive implementation. Since single-node consolidation exists solely to replace one EC2 instance with a cheaper variant, having a way to control the threshold at which karpenter will decide to replace the node would be wonderful (eg, only replace if you'll reduce the cost of the node by >15%). |
+1 for price improvement threshold. I think this is orthogonal to consolidateAfter, though. Do you mind cutting a new issue for this, and referencing https://github.com/kubernetes-sigs/karpenter/blob/main/designs/spot-consolidation.md#2-price-improvement-factor? |
Done. |
Is there a way to have this implemented/released sooner than v1? I think this will give everyone a way to handle the aggressiveness of the scaling down a little better |
@miadabrin I think the urgency of this issue is evident from the lively discussion. Development for the issue is ongoing at #1218 and the discussion seems semi-active, but I'm unsure how close this PR is to completion. |
@samox73 The maintainers have stated that the PR you linked has been abandoned in favor of a new implementation of consolidateAfter. |
I second that, it won't be a breaking change per-se because it was disallowed to set |
With the new karpenter version 1.0.1, from this blog post
this gives a better solution |
Think this can be closed as it's fixed in karpenter 1.0.0 |
@joewragg It was closed on the 31st of July. |
Tell us about your request
We have a cluster where there are a lot of cron jobs which run every 5 minutes...
This means we have 5 nodes for our base workloads and every 5 minutes we get additional nodes for 2-3 minutes which are scaled down or consolidated with existing nodes.
This leads to a constant flow of nodes joining and leaving the cluster. It looks like the docker image pull and node initialization creates more network traffic fees than the cost reduction of not having running the instances all the time.
It would be great if we could configure some time consolidation period maybe together with ttlSecondsAfterEmpty which would only cleanup or consolidate nodes if the capacity was idling for x amount of time.
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
Creating a special provisioner is quite time consuming because all app deployments have to be changed to leverage it...
Are you currently working around this issue?
We think about putting cronjobs into a special provisioner which would not use consolidation but the ttlSecondsAfterEmpty feature.
Additional Context
No response
Attachments
No response
Community Note
The text was updated successfully, but these errors were encountered: