Support handling custom kube reserved values #1803

stevehipwell · 2022-05-13T12:40:30Z

Tell us about your request
I'd like Karpenter to work correctly when not using the EKS optimised AMI calculations for --kube-reserved (as already discussed in #1490). The current logic in Karpenter is incorrect as it makes the assumption that pod count is directly related to the number of ENIs on an instance and that the kube reserved memory calculation is a function of pods rather than memory.

I'd like Karpenter to use the GKE memory calculation when AWS_ENI_LIMITED_POD_DENSITY=false as the CPU limit calculation already comes from this source.

Karpenter should probably also set the value for ephemeral storage too.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
We run EKS nodes using either custom networking and IPv4 IP prefixes or IPv6 IP prefixes which means that IP addresses aren't a scarcity. We follow the K8s large clusters guide so have a maximum 110 pods with custom kube reserved values based on the GKE calculations and we need Karpenter to respect these settings.

Are you currently working around this issue?
I'm hoping that Karpenter doesn't break with custom values set and if it does we'll move back to Cluster Autoscaler.

Additional context
I've been using this configuration in production ever since the AWS VPC CNI supported IP prefixes and it's solved a number of issues which we had with nodes before making this change.

Attachments
n/a

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

stevehipwell · 2022-05-13T12:40:44Z

CC @bwagner5

bwagner5 · 2022-05-13T15:36:54Z

We'll be setting ephemeral storage accurately soon.

But agreed on Karpenter respecting a custom kube-reserved and, in general, overhead values for nodes.

stevehipwell · 2022-05-13T15:38:27Z

Would you take a PR to use the GKE logic when AWS_ENI_LIMITED_POD_DENSITY=false as a stopgap?

suket22 · 2022-05-13T15:42:26Z

We're going to respect custom kube-reserved values. We might just do it as part of spec.kubeletConfiguration so you don't need to write a custom UserData script for it. That way it'd help our binpacking logic too which would need to account for it.

Changing the memoryReserved logic might surprise our other users so I'm hesitant to make that change.

bwagner5 · 2022-05-13T15:47:14Z

@stevehipwell I guess respecting a static value wouldn't help you much since you still need it to be a dynamic calc?

stevehipwell · 2022-05-13T15:52:04Z

@suket22 having it available as part of spec.kubeletConfiguration would be great.

Changing the memoryReserved logic might surprise our other users so I'm hesitant to make that change.

I'm yet to see any documentation as to how the value was calculated and the goalposts have moved since then so I'm not sure why you're sticking with it as a calculation? Karpenter is pre v1 so each release is technically a breaking change.

But new logic could be added under a new env variable to act as a pair to AWS_ENI_LIMITED_POD_DENSITY.

@bwagner5 doesn't Karpenter know the instance type when it calculates the reserved values? I think it currently uses the number of CPUs.

bwagner5 · 2022-05-13T16:26:53Z

Yes, Karpenter knows the instance type and all specs of it, but it would need to be a calculation within the controller. spec.kubeletConfiguration would be used to set a static value.

stevehipwell · 2022-05-13T17:33:33Z

@bwagner5 you're absolutely right! So I need the alternative algorithm and to make it match my user data.

suket22 · 2022-05-24T17:38:46Z

Good point on a static value in the provisioner not helping here. But does a calculation inside Karpenter help either?
Karpenter doesn't know which instanceType fleet is going to select, so any value we set inside the controller might not be correct (unless ofcourse you've restricted Karpenter to use a set of homogenous instanceTypes but we probably can't rely on that).

It sounds to me that the right place this alternative calculation should exist is within the EKS AMI itself. In the short term, I think the right approach for Karpenter is probably letting you set these values within a UserData script.

stevehipwell · 2022-05-24T18:01:06Z

@suket22 if Karpenter can calculate the CPU from the instance CPUs then it can calculate the memory as it's a function of instance memory. No matter what the calculation used is you're going to either need a deterministic calculation based on CPUs & memory that can be run on both sides, or a way to pass the calculated values back from the instance.

stevehipwell · 2023-07-03T09:41:17Z

@bwagner5 where have we got with this? Not being able to configure our nodes to use the GKE calculation for kube-reserved memory is blocking our GA adoption of Karpenter.

stevehipwell · 2023-11-13T18:32:56Z

After seeing the new spec for AKS reservations I've realised that this implementation should be flexible. I've got a proposed API for handling kube-reserved configuration as part of the NodePool. I also think this should be extended to also support ephemeral-storage in the future it it's also dependent on system resources.

Current

Only supports single node size per pool.

kubelet:
  kubeReserved:
    cpu: 200m
    memory: 100Mi

CPU Core Based

Supports EKS current algorithm.

Set reserved value for each core (1000m) with the last value being used for remaining cores (can be 0m)

kubelet:
  kubeReserved:
    cpuCalc:
      reserved: ["600m", "100m", "50m", "50m", "25m"]

Memory Pod Based

Supports current EKS algorithm.

Both max and min can be a percentage like 25%

kubelet:
  kubeReserved:
    memoryCalc:
      mode: pods
      multiple: 11m
      constant: 255m
      max:
      min:

New AKS pattern.

kubelet:
  kubeReserved:
    memoryCalc:
      mode: pods
      multiple: 20m
      constant: 50m
      max: 25%
      min:

Memory System Memory Based

Supports GKE algorithm (example shows up to 16GB memory).

Set reserved value for each GB or memory (1024Mi) with the last value being used for remaining GBs (can be 0Mi)
Both max and min can be a percentage like 25%

kubelet:
  kubeReserved:
    memoryCalc:
      mode: memory
      reserved: ["256Mi", "256Mi", "256Mi", "256Mi", "205Mi", "205Mi", "205Mi", "205Mi", "102Mi", "102Mi", "102Mi", "102Mi", "102Mi", "102Mi", "102Mi", "102Mi"]
      max:
      min: 255Mi

CC @bwagner5 @jonathan-innis

NOTE - I updated this comment to fix the API to avoid ambiguous types.

jonathan-innis · 2023-12-14T07:45:28Z

I updated this comment to fix the API to avoid ambiguous types

Just doing a cursory look over the proposed API, this strikes me as really difficult for someone to reason about how Karpenter is actually going to reason about the way that Karpenter is going to go about calculating these values.

I still find myself leaning towards a static value approach (something similar but maybe not looking exactly like the instance type overrides approach) where some other component is responsible for reconciling those static values in a way that allows scenarios like the ones that you have called-out above

stevehipwell · 2023-12-15T15:38:41Z

Just doing a cursory look over the proposed API, this strikes me as really difficult for someone to reason about how Karpenter is actually going to reason about the way that Karpenter is going to go about calculating these values.

@jonathan-innis the problem here is that we need to model a complex system so the API will be complex; if you need this functionality then I think you'd be able to understand it.

I still find myself leaning towards a static value approach (something similar but maybe not looking exactly like the instance type overrides approach) where some other component is responsible for reconciling those static values in a way that allows scenarios like the ones that you have called-out above

Yes that could work, are you thinking named algorithms. In a dynamic reserved scenario it'd be pretty easy to get it working in AL2 but for Bottlerocket we need a bootstrap container bottlerocket-os/bottlerocket#2010. Ideally published by Bottlerocket but Karpenter could create it's own as within AWS this would have trivial cost as it should just be the public ECR image with an alternate entrypoint.

All of that said I've had some more thoughts on this whole area and I think the following logic describes the whole system.

Kube reserved should be a function of the maximum number of pods
- Additional functionality (such as Containerd plugins) could change the actual value required
Pod overhead could be used to limit the impact of kube reserved on small nodes
- This isn't perfect as there is no principal of default overheads
- This would require a mutating admission controller
System reserved memory should be a function of the SKU or large enough to accommodate the largest SKU
- Greater node memory results in the kernel using more memory
System reserved CPU is likely to be fixed

I think the current EKS memory calculation is probably the worst of all worlds as it's based on ENI limits for pod density and Docker as the container runtime. I think the GKE memory calculation covers most of the above logic but isn't perfect; it's based on Docker, all dynamic reserved are in kube reserved & small nodes may still not have enough memory reserved. I think the current CPU calculation (all are the same) might be able to be simplified as I don't think it's a function of available cores.

TL;DR - Maybe static values would work on nodes over a certain size but we'd still need a pattern for small nodes unless using pod overhead.

stevehipwell · 2024-08-16T09:58:08Z

We're currently experimenting running static kubeReserved and systemReserved across all node sizes as all of our nodes support a maximum of 110 pods. We're using the following values based on some experimentation we ran.

systemReserved:
  cpu: 100m
  memory: 100Mi
  ephemeral-storage: 1Gi

kubeReserved:
  cpu: 100m
  memory: 1465Mi
  ephemeral-storage: 1Gi

stevehipwell added the feature New feature or request label May 13, 2022

stevehipwell mentioned this issue May 13, 2022

v1beta1 Laundry List #1327

Closed

suket22 self-assigned this May 23, 2022

jonathan-innis mentioned this issue Aug 29, 2022

RFC: Add instance type settings design proposal #2390

Closed

3 tasks

bryanasdev000 mentioned this issue Jan 21, 2023

EKS nodes lose readiness when containers exhaust memory awslabs/amazon-eks-ami#1145

Open

stevehipwell closed this as completed Nov 13, 2023

stevehipwell reopened this Nov 13, 2023

jigisha620 mentioned this issue May 1, 2024

Karpenter provisioned nodes become "NotReady" #4200

Closed

stevehipwell mentioned this issue Aug 16, 2024

No ability to set dynamic value for kubeReserved cpu #6665

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support handling custom kube reserved values #1803

Support handling custom kube reserved values #1803

stevehipwell commented May 13, 2022

stevehipwell commented May 13, 2022

bwagner5 commented May 13, 2022

stevehipwell commented May 13, 2022

suket22 commented May 13, 2022 •

edited

Loading

bwagner5 commented May 13, 2022

stevehipwell commented May 13, 2022

bwagner5 commented May 13, 2022

stevehipwell commented May 13, 2022

suket22 commented May 24, 2022

stevehipwell commented May 24, 2022

stevehipwell commented Jul 3, 2023

stevehipwell commented Nov 13, 2023 •

edited

Loading

jonathan-innis commented Dec 14, 2023

stevehipwell commented Dec 15, 2023

stevehipwell commented Aug 16, 2024

Support handling custom kube reserved values #1803

Support handling custom kube reserved values #1803

Comments

stevehipwell commented May 13, 2022

Community Note

stevehipwell commented May 13, 2022

bwagner5 commented May 13, 2022

stevehipwell commented May 13, 2022

suket22 commented May 13, 2022 • edited Loading

bwagner5 commented May 13, 2022

stevehipwell commented May 13, 2022

bwagner5 commented May 13, 2022

stevehipwell commented May 13, 2022

suket22 commented May 24, 2022

stevehipwell commented May 24, 2022

stevehipwell commented Jul 3, 2023

stevehipwell commented Nov 13, 2023 • edited Loading

Current

CPU Core Based

Memory Pod Based

Memory System Memory Based

jonathan-innis commented Dec 14, 2023

stevehipwell commented Dec 15, 2023

stevehipwell commented Aug 16, 2024

suket22 commented May 13, 2022 •

edited

Loading

stevehipwell commented Nov 13, 2023 •

edited

Loading