-
Notifications
You must be signed in to change notification settings - Fork 960
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for limiting max pods per node to 110 #1490
Comments
FYI @bwagner5. |
Setting |
We should definitely document that... |
So jumping the gun on the docs, does this switch to using 110 pods? Also jumping the gun on some other comments I made, does Karpenter uses custom kube reserved calculations? |
Apologies, was jumping the gun a bit :) Yes, it does use 110 pods when setting AWS_ENI_LIMITED_POD_DENSITY to false (https://github.com/aws/karpenter/blob/e37909b223fd7625b57b59d28681561895643f09/pkg/cloudprovider/aws/launchtemplate.go#L364-L368) You are correct that the kube-reserved calc still uses the eni limit: Let me change this issue back to track that issue and I'll create a new one for the AWS_ENI_LIMITED_POD_DENSITY docs. |
@bwagner5 I think you were right to rename this issue if the flag sets the max pods to 110, there should probably be a new issue for changing the overhead calculation and setting the kubelet flags with the correct overheads? On a related topic is it documented somewhere what node overrides you provide to bootstrap.sh or the BottleRocket TOML? I've got a list of functionality that we have to currently use a LaunchTemplate for that should probably be covered by Native Karpenter config. I'd be happy to contribute towards implementing some of these if they're not already done. |
We would love to work towards that list of needed LT parameters! I have a PR out that hopefully can get merged today that makes adding parameters to AMI families much easier (and also supports block device mappings). We'll probably go full steam ahead on getting LT parameters within Karpenter once that is merged. #1420 |
I don't think we're settled on what approach we want to take, but I'd want us to try and fix this in the EKS-optimized AMI and not have Karpenter repeat the calculation and set that on the kubelet. The issue you've cut in the EKS-optimized AMI repo is what needs to be addressed first IMO. It needs extensive testing and the blast radius of making changes there is pretty large - new overheads being too large will lead to less dense nodes and higher EC2 costs, and overheads being too low (which I think is what you're running into), will lead to other node stability issues. The BottleRocket AMI has just copied over the numbers that the EKS-Optimized AMIs have so whatever new calculations we do, can probably be ported there too. @ellistarn and I have been discussing this a bunch, and we're honestly just looking for resourcing to come up with better values. We absolutely intend on redoing our calculation for kube-reserved cpu and mem. |
@bwagner5 I added the high value ones to #1327 and opened #1495 for the AMI version. Is that enough? @suket22 have you seen how old the AMI issue is? If you look around on other issues and PRs you'll see that I've been asking this question for even longer. Furthermore if you look at the source of the AL2 calculations you'll see that they copied the GKE values for some but not all of it (due to the ENI pod limit). My issue has the actual K8s docs linked and explains why this should all have been re-worked when the CNI started supporting prefixes (if not before). Finally I'm suggesting it as an optional parameter, but I can say that the current logic is actually incorrect and can result in stuck nodes; I'd expect most users to be overriding them by default. |
I think I called this out in the other issue, but while I agree with your overall sentiment, keep in mind the bootstrap script has no notion of what version of the CNI is running, so we can't make a change that benefits just those that has PrefixDelegation enabled. If we move to the GKE numbers for mem reserved, we will break a ton of customers (like before). I agree with you it needs re-work, and it's been a long time, there's just been a lack of resourcing on the EKS-Optimized AMI front. |
@suket22 as an optional variable the consumer knows which mode to use. The Karpenter RBAC could also be extended to automate this by looking up the CNI and it's settings. RE the reserved memory, it wouldn't break customers it would actually stop potential issues (we've hit this a number of times and always specify the GKE values manually). What it would do is lower pod density, which would be comparable to GKE but still higher than AKS with it's large eviction threshold. I'm pretty sure the 110 number is based on the underlying OS kernel as well as capacity for single components such as kubelet and DaemonSets. I get why the ENI limits were lower than this but I can't conceive how a greater number can be justified? |
@bwagner5 where are we with this, is it just the documentation missing? |
This is mostly a docs issue, but I'd still like to move away from a boolean flag to switch max-pods. @suket22 has a PR out (#1720) for bottlerocket config support and is working on AL2 as well (I realize you're not using either of these but it should carry over to generic bootstraps as well). We'd like to get to the point where the user is in full-control over these params and Karpenter respects them in packing decisions when passed via the |
@bwagner5 I was literally just looking at the docs around this functionality as I've been trying to see what currently missing from Karpenter that we can't live without. |
Provisioner-level |
@stevehipwell Let us know if there's more here, but should be good to go! |
Tell us about your request
I'd like Karpenter to support setting a max pods per node limit, specifically 110 based on the K8s large clusters guide.
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
The current EKS logic for max pods contradicts the official Kubernetes guidance and with ENI prefixes a node can theoretically support a huge number of pods. This also makes the kube reserved calculation much simpler as all nodes would use the same value.
Are you currently working around this issue?
n/a
Additional context
Anything else we should know?
Attachments
n/a
Community Note
The text was updated successfully, but these errors were encountered: