Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for limiting max pods per node to 110 #1490

Closed
stevehipwell opened this issue Mar 9, 2022 · 17 comments
Closed

Add support for limiting max pods per node to 110 #1490

stevehipwell opened this issue Mar 9, 2022 · 17 comments
Labels
documentation Improvements or additions to documentation feature New feature or request

Comments

@stevehipwell
Copy link
Contributor

Tell us about your request
I'd like Karpenter to support setting a max pods per node limit, specifically 110 based on the K8s large clusters guide.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
The current EKS logic for max pods contradicts the official Kubernetes guidance and with ENI prefixes a node can theoretically support a huge number of pods. This also makes the kube reserved calculation much simpler as all nodes would use the same value.

Are you currently working around this issue?
n/a

Additional context
Anything else we should know?

Attachments
n/a

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@stevehipwell stevehipwell added the feature New feature or request label Mar 9, 2022
@stevehipwell
Copy link
Contributor Author

FYI @bwagner5.

@bwagner5
Copy link
Contributor

bwagner5 commented Mar 9, 2022

Setting --AWSENILimitedPodDensity=false (or env var AWS_ENI_LIMITED_POD_DENSITY=false) should do what you want.

https://github.com/aws/karpenter/blob/e37909b223fd7625b57b59d28681561895643f09/pkg/utils/options/options.go#L45

@bwagner5
Copy link
Contributor

bwagner5 commented Mar 9, 2022

We should definitely document that...

@bwagner5 bwagner5 added the documentation Improvements or additions to documentation label Mar 9, 2022
@bwagner5 bwagner5 changed the title Add support for limiting max pods per node to 110 Document AWSENILimitedPodDensity to limit max-pods to 110 Mar 9, 2022
@stevehipwell
Copy link
Contributor Author

So jumping the gun on the docs, does this switch to using 110 pods? Also jumping the gun on some other comments I made, does Karpenter uses custom kube reserved calculations?

@bwagner5
Copy link
Contributor

bwagner5 commented Mar 9, 2022

Apologies, was jumping the gun a bit :)

Yes, it does use 110 pods when setting AWS_ENI_LIMITED_POD_DENSITY to false (https://github.com/aws/karpenter/blob/e37909b223fd7625b57b59d28681561895643f09/pkg/cloudprovider/aws/launchtemplate.go#L364-L368)

You are correct that the kube-reserved calc still uses the eni limit:
https://github.com/aws/karpenter/blob/e37909b223fd7625b57b59d28681561895643f09/pkg/cloudprovider/aws/instancetype.go#L136

Let me change this issue back to track that issue and I'll create a new one for the AWS_ENI_LIMITED_POD_DENSITY docs.

@bwagner5 bwagner5 changed the title Document AWSENILimitedPodDensity to limit max-pods to 110 Add support for limiting max pods per node to 110 Mar 9, 2022
@bwagner5 bwagner5 removed the documentation Improvements or additions to documentation label Mar 9, 2022
@stevehipwell
Copy link
Contributor Author

@bwagner5 I think you were right to rename this issue if the flag sets the max pods to 110, there should probably be a new issue for changing the overhead calculation and setting the kubelet flags with the correct overheads?

On a related topic is it documented somewhere what node overrides you provide to bootstrap.sh or the BottleRocket TOML? I've got a list of functionality that we have to currently use a LaunchTemplate for that should probably be covered by Native Karpenter config. I'd be happy to contribute towards implementing some of these if they're not already done.

@bwagner5
Copy link
Contributor

We would love to work towards that list of needed LT parameters! I have a PR out that hopefully can get merged today that makes adding parameters to AMI families much easier (and also supports block device mappings). We'll probably go full steam ahead on getting LT parameters within Karpenter once that is merged. #1420

@suket22
Copy link
Contributor

suket22 commented Mar 10, 2022

I think you were right to rename this issue if the flag sets the max pods to 110, there should probably be a new issue for changing the overhead calculation and setting the kubelet flags with the correct overheads?

I don't think we're settled on what approach we want to take, but I'd want us to try and fix this in the EKS-optimized AMI and not have Karpenter repeat the calculation and set that on the kubelet. The issue you've cut in the EKS-optimized AMI repo is what needs to be addressed first IMO. It needs extensive testing and the blast radius of making changes there is pretty large - new overheads being too large will lead to less dense nodes and higher EC2 costs, and overheads being too low (which I think is what you're running into), will lead to other node stability issues.

The BottleRocket AMI has just copied over the numbers that the EKS-Optimized AMIs have so whatever new calculations we do, can probably be ported there too.

@ellistarn and I have been discussing this a bunch, and we're honestly just looking for resourcing to come up with better values. We absolutely intend on redoing our calculation for kube-reserved cpu and mem.

@stevehipwell
Copy link
Contributor Author

We would love to work towards that list of needed LT parameters!

@bwagner5 I added the high value ones to #1327 and opened #1495 for the AMI version. Is that enough?

@suket22 have you seen how old the AMI issue is? If you look around on other issues and PRs you'll see that I've been asking this question for even longer. Furthermore if you look at the source of the AL2 calculations you'll see that they copied the GKE values for some but not all of it (due to the ENI pod limit). My issue has the actual K8s docs linked and explains why this should all have been re-worked when the CNI started supporting prefixes (if not before). Finally I'm suggesting it as an optional parameter, but I can say that the current logic is actually incorrect and can result in stuck nodes; I'd expect most users to be overriding them by default.

@suket22
Copy link
Contributor

suket22 commented Mar 10, 2022

My issue has the actual K8s docs linked and explains why this should all have been re-worked when the CNI started supporting prefixes (if not before).

I think I called this out in the other issue, but while I agree with your overall sentiment, keep in mind the bootstrap script has no notion of what version of the CNI is running, so we can't make a change that benefits just those that has PrefixDelegation enabled. If we move to the GKE numbers for mem reserved, we will break a ton of customers (like before).

I agree with you it needs re-work, and it's been a long time, there's just been a lack of resourcing on the EKS-Optimized AMI front.

@stevehipwell
Copy link
Contributor Author

@suket22 as an optional variable the consumer knows which mode to use. The Karpenter RBAC could also be extended to automate this by looking up the CNI and it's settings.

RE the reserved memory, it wouldn't break customers it would actually stop potential issues (we've hit this a number of times and always specify the GKE values manually). What it would do is lower pod density, which would be comparable to GKE but still higher than AKS with it's large eviction threshold.

I'm pretty sure the 110 number is based on the underlying OS kernel as well as capacity for single components such as kubelet and DaemonSets. I get why the ENI limits were lower than this but I can't conceive how a greater number can be justified?

@bwagner5
Copy link
Contributor

@bwagner5 I added the high value ones to #1327 and opened #1495 for the AMI version. Is that enough?

Yep! Thanks!

@felix-zhe-huang felix-zhe-huang added the documentation Improvements or additions to documentation label Mar 14, 2022
@stevehipwell
Copy link
Contributor Author

@bwagner5 where are we with this, is it just the documentation missing?

@bwagner5
Copy link
Contributor

This is mostly a docs issue, but I'd still like to move away from a boolean flag to switch max-pods. @suket22 has a PR out (#1720) for bottlerocket config support and is working on AL2 as well (I realize you're not using either of these but it should carry over to generic bootstraps as well). We'd like to get to the point where the user is in full-control over these params and Karpenter respects them in packing decisions when passed via the KubeletConfiguration in the provisioner spec.

@stevehipwell
Copy link
Contributor Author

@bwagner5 I was literally just looking at the docs around this functionality as I've been trying to see what currently missing from Karpenter that we can't live without.

@jonathan-innis
Copy link
Contributor

Provisioner-level --max-pods can now be set with the addition of #2261

@bwagner5
Copy link
Contributor

@stevehipwell Let us know if there's more here, but should be good to go!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation feature New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants