-
Notifications
You must be signed in to change notification settings - Fork 519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decouple MaxPods from CNI Logic #1721
Comments
In the short term, we are planning to add these new instances to our existing list so the correct
This definitely helps and is something we support, but because the node cannot determine CNI type/version, we haven't been able to figure this out dynamically. Instead, when using this feature, customers will need to specify the
They do, but again, we don't have a way to figure that out on the node when determining
You may be right here, but until we know we're no longer limited by ENI attachment, changing this behavior to key off of core count or memory size could make sense, but I'm wary of the effect it may have on existing customers and their expectations.
This new setting is an interesting option, however, it seems to me that it would need to be calculated by the customer based on their choice of CNI plug-in, delegated prefix size, and instance type choice. Which leaves us in a similar state to where we are today but with a more abstracted setting to configure I'd like to keep this issue open and updated while we continue to explore our options for making this work more cleanly for our customers. |
The main challenge of specifying What are your thoughts about building this as an option? You could even support multiple options for max pods like you do today, but just add another option.
|
@ellistarn how come you want to use more than 110 pods on a node, the official K8s guidance is a maximum of 110 pods per node? I was going to open a new issue to suggest that the default logic should be to use 110, with a pattern to use for running the AWS VPC CNI in legacy non-prefix (IPv4 or IPv6) mode to set the limit for instances which can't support 110 pods. But thought it'd be best to continue the discussion here. We also need to talk about setting the kube reserved values, these should be directly linked to the node resources so need to be dynamically generated. The GKE docs are the best reference for this, the AL2 AMI already uses this for it's CPU calculation. I also opened awslabs/amazon-eks-ami#782 for the AL2 AMI but haven't had any responses yet. |
I'm open minded to this idea. @RiskyAdventure has been discussing with me the perf impact of context switching between so many processes, at least when under load. Historically, k8s-on-aws has offered much higher densities, so this would be backwards incompatible. We'd need to figure out a glide path, and probably want the same solution for EKSOptimizedAMI and Bottlerocket. |
I agree, this should be a consistent default across all of the EKS ecosystem which should align with K8s best practice (the current AL2 EKS defaults will cause actual issues in relative common scenarios). |
Following up on this thread as I'm picking up work for Karpenter to enable a |
There's no specific reason; it's only that the implementation hasn't been picked up, not that there's been an intentional choice not to go the It sounds like the work here is:
I'm not clear on what should be done with the An updated approach for calculating
That would be straightforward and compatible with existing deployments, unless there's a strong case to be made for a @ellistarn , @stevehipwell - what do you think? |
@bcressey I think it's worth splitting out the actual behaviour requirements and then working back up from there to get the settings spec; I would list these as max-pods, kube-reserved memory and kube-reserved CPU. I think @bwagner5 might have some thoughts on this and has been testing some of the options. Max PodsThe basic requirements for max pods are as follows, I would argue that a default of
For the implementation, pods per core and ENI modes could use the number as a constraint if it was set meaning that my suggested default of Kube Reserved MemoryThe legacy ENI behaviour for kube-reserved memory is based on the ENI IP logic and is a function of pods rather than a function of system memory as found in other implementations; this wasn't updated for ENI prefix pod counts for AL2 as it appears to have been here. The GKE implementation (also used by AKS amongst others) would be my preferred logic and this is what I manually specify when running EKS nodes. Pod Overhead can be used in combination with this. This value needs to be configurable as a dynamic value to be able to use a single configuration across different node sizes with Karpenter.
Kube Reserved CPUEveryone already uses the GKE logic for this, so it's also the legacy ENI logic. As above Pod Overhead can be used in combination with this. This value needs to be configurable as a dynamic value to be able to use a single configuration across different node sizes with Karpenter.
ImplementationI think the following inputs and values would cover all of the use cases above.
My end-user requirements for this would be to be able to set my nodes to have a max-pods of [settings.kubernetes]
"max-pods-mode" = "default"
"max-pods" = 110
"kube-reserved-memory-calc" = "system-memory"
"kube-reserved-cpu-calc" = "system-cpu"
[settings.kubernetes.kube-reserved]
ephemeral-storage = "1Gi" |
I think in general this all makes sense based on the use cases. In terms of the API surface, is there a reason that [settings.kubernetes]
"max-pods-mode" = "default"
"max-pods" = 110
[settings.kubernetes.kube-reserved]
ephemeral-storage = "1Gi"
memory-calc-mode = "system-memory"
cpu-calc-mode = "system-cpu" |
@jonathan-innis the API design was to illustrate my points with the expectation that the actual design would be different. I agree that if possible nesting the values would be better. |
Sure, with that in mind, what you put sounds reasonable to me |
Missed this thread, but this all sounds pretty reasonable to me as well. There is one more case for the VPC CNI "Custom Networking" configured via the ENIConfig CRD where you want nodes to have a different subnet/SG than pods which alters the ENI limited calcs since you lose a usable ENI for pods. I'd also hope that overtime the ENI limited max pods definition would become less relevant and the maximum amount of pods a node can handle would be based on resource limits, in which case 110 is probably a reasonable upper bound. I have done some preliminary data gathering on the overhead difference for kube-reserved memory that is not based on ENI limited max-pods here. I haven't run tests on these yet and I think we'd need to do quite a bit of testing to make sure this wouldn't surprise users (or allow an escape hatch). |
I think we can definitely decouple these two features:
In the case of On top of that, I don't see a reason to add a It feels like we can punt both the GKE calculation decision and the
|
@jonathan-innis , do you still think https://kubernetes.io/docs/concepts/containers/runtime-class/#pod-overhead may be a fruitful route? |
I think from a The other consideration is that AL2 and Ubuntu still will not support the GKE calculation for I don't want to lose the focus of the discussion, though, which is arguing that if there is configuration that is surfaced in Kubelet Configuration, Bottlerocket should also support it to make the experience as seamless as possible for users. |
@jonathan-innis
I think @bwagner5 highlighted the reason why you need this with the VPC CNI custom networking mode.
I don't think these two features are linked in this way, they have interactions but aren't mutually exclusive as I described in #1721 (comment). I see |
I don't think this is a reason to have a
Can you explain how they aren't linked in this way? We are currently doing the calculation based on pods so we should continue to do the calculation based on the max pods based on the min of the max-pods and pods-per core as described in this equation
Completely agree with this |
@jonathan-innis I'm pretty sure I've heard this reason given for why the way AL2 currently works cannot be changed, if you can automate the discovery of the VPC CNI and it's specific mode (legacy, custom networking, IPv4 prefix, or IPv6) then I agree that this isn't needed.
Because the GKE calculation doesn't use max pods in it's algorithm, so adding the capability to set a max pods per core isn't related to using the GKE calculation. So what I'm saying is the implementing In my opinion the GKE algorithm should be the highest priority here as it removes the requirement to couple max pods to kube reserved, which means when support for I am biased as it is the GKE kube reserved functionality that is currently blocking me as I don't want to have to figure out creating a custom configuration container to set kube reserved and then figure out how to get it to work with Karpenter. |
Has there been any more progress on this? Specifically towards an API design similar to in my comment above and copied here. [settings.kubernetes]
"max-pods-mode" = "default"
"max-pods" = 110
"kube-reserved-memory-calc" = "system-memory"
"kube-reserved-cpu-calc" = "system-cpu"
[settings.kubernetes.kube-reserved]
"ephemeral-storage" = "1Gi" |
I see this has been removed from the backlog, is there a reason why? @jonathan-innis I don't suppose you could point me in the right direction to override the kube-reserved at runtime? |
Sorry about the churn. We aren't using the |
@dmitmasy another one for your awareness and prioritization. |
@stmcginnis is there an example of modifying the settings from within a bootstrap container which I could use as the basis for a temporary solution? |
Not that I know of. I think |
@stmcginnis I hardcode max pods to apiclient set \
kubernetes.kube-reserved.cpu="${kube_reserved_cpu}m" \
kubernetes.kube-reserved.memory="${kube_reserved_memory}Mi" As an aside and I might not be looking hard enough but I expected to find a container in the public ECR which would allow me to bootstrap Bottlerocket directly from userdata. |
You could set those values as well. Typically these settings are provided at boot time in the user data, so there isn't a need to have a separate bootstrap container to make these |
@stmcginnis that's what I do for ASGs as the defaults are wrong enough to brick nodes. But with Karpenter the userdata needs to be generic as the instance can be any size. This issue tracks the actual details, but I'm currently trying to POC dynamically setting these as I can do for AL2. |
Yeah, in that case unfortunately you will have to go with the container route to call |
I think #2010 is relevant here. |
@stevehipwell did you find a way to dynamically set cpu / memory for kube-reserved? |
@olfway I've been waiting for Karpenter to support alternative kube-reserved calculations. But AFAIK there is still no official ECR image with the correct entrypoint for this. |
Sorry for the silly question but just to make sure I uderstand correctly: does this issue mean that Bottlerocket with prefix delegation is not handling max-pods correctly? apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: Bottlerocket
amiSelectorTerms:
- alias: bottlerocket@latest
kubelet:
maxPods: 110
blockDeviceMappings:
# Root device
- deviceName: /dev/xvda
ebs:
volumeSize: 4Gi
volumeType: gp3
encrypted: true
# Data device: Container resources such as images and logs
- deviceName: /dev/xvdb
ebs:
volumeSize: 20Gi
volumeType: gp3
encrypted: true
role: eks-management-karpenter-node-role
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: management
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: management
tags:
karpenter.sh/discovery: management This way nodes are created with a huge amount of system reserved memory (40%) and they get packed up with Pods, which leads to Pod evictions. |
@irizzant unless something has changed the reserved values are tied to the arbitrary number of pods AWS says can run on a specific node type, and this uses the pre prefix logic where the instance type ENI count limits pod density. This also ignores the actual user defined max pods if set This logic is incorrect for a number of reasons. Not respecting the user defined max pods makes zero sense. Prefix mode makes the pod density limitations disappear. K8s officially has a 110 max pods per node recommendation, and as all nodes under prefix mode can support 110 pods dynamically this should be the default. Also as the per pod calculation was run on AL2 using Docker as the CRI a number of years ago it should be recalculated for Bottlerocket. My recommendation would be to always explicitly set max pods and kube reserved. I make the reserved a function of max pods and have run tests using pods with various container counts to come up with a value. |
@stevehipwell thanks for the reply but I'm confused. Also EKS AMI has its own logic: When I say max pod = 110 I expect reserved memory to be around 1.2 Gi following EKS AMI logic, which is a lot for small nodes. Are you saying that when I add max pod = 110 Karpenter logic is overriden by EKS AMI? |
@irizzant I was specifically referring to the AMI logic, not the Karpenter logic. Last time I dug into the details Karpenter was mirroring this behaviour, but I've switched approach and haven't been following closely for a few months. As Karpenter uses the fleet API I'm not sure it can set dynamic reserved values by instance type as it doesn't know the actual types in advance. I think any calculations it's doing are to mirror the AMI defaults. When I'm back on a reasonable size screen I'll take a look at the logic in detail. |
@stevehipwell thanks, I've checked myself and it looks like you're right about AMI logic being the same as Karpenter's, it uses: which again it's too much for small nodes. Also, I may be wrong but I don't remember experiencing issues with Pod density before upgrading to Karpenter v1. Maybe this issue is related ? aws/karpenter-provider-aws#6890 I'm currently trying to test my above mentioned |
@stevehipwell I confirm that this is not releated to Karpenter v1, the previous version v0.37 gives the same result. Given the absence of a way to set dynamic reserved values by instance type I suppose the only way to proceed is to set kube reserved values manually and adjust NodeClass to be big enough to tolerate the chosen values right? |
@stevehipwell thanks for the updates, as I wrote in awslabs/amazon-eks-ami#1145 (comment) I still don't understand how to pass dynamic values in Karpenter NodeClasses with bootstrap commands |
@irizzant I'm not sure if I fully understand your question, but to implement this via Karpenter you'd need a command that can do the dynamic work and then pass that in via user data. |
@stevehipwell let me try to explain. The bootstrap commands doc shows you can use Is it possible to query the instance data (cpu, memory, ... ) using If yes then I'd pass the calculation to Karpenter NodeClass using user data and we're good. Otherwise can you please detail how to proceed? |
@irizzant that is something that I think you'd need to speak to the maintainers about. I'm not sure what data the |
What I'd like:
I ran into am issue when using https://github.com/awslabs/karpenter and Bottlerocket with the new c6i instance types (see: #1720). Karpenter simulates pod scheduling and provisions instances by discovering instance types from EC2 and binpacking pods onto the node. It uses a formula (14 * AttachableENICount) to compute the max pods value. It also binds the pods before the node comes online as an optimization. If Bottlerocket is unaware of a new instance type, it will default to MaxPods of 110, which is woefully short of the actual number of pods that can be scheduled using the AWS VPC CNI. This causes Karpener's scheduling logic to disagree with the Node's MaxPods, and the Kubelet reports OutOfPods errors for the bound pods.
It's challenging for Bottlerocket to come up with a good default here (i.e. 110), since you have no idea whether or not the unknown instance type is large or small. Calling EC2 Describe Instances on OS boot will introduce significant call volume from large scale ups, as well as increase startup latency.
I see a decent path forward:
I'd love to see an option in Bottlerocket to assume that the CNI is not the limiting factor for max pods, and instead compute max pods based off of cpu/memory of the instance (e.g.
max-pods-per-core: 10
), which can cheaply be discovered at bootAny alternatives you've considered:
Looking forward to discussing this path or any others you folks identify.
The text was updated successfully, but these errors were encountered: