Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter provisioned nodes become "NotReady" #4200

Closed
maximveksler opened this issue Jul 4, 2023 · 7 comments
Closed

Karpenter provisioned nodes become "NotReady" #4200

maximveksler opened this issue Jul 4, 2023 · 7 comments
Assignees
Labels
bug Something isn't working triage/needs-investigation Issues that need to be investigated before triaging

Comments

@maximveksler
Copy link

Description

Observed Behavior:

We are running EKS 1.24 dev cluster consisting of 6 nodes:

  • 3x m5.4xlarge aimed for dev deployment workloads
  • 3x r6i.large tainted for sts

The environment is moderately volatile, consisting of CI jobs and other CronJob as well as ~20-30 dev environments (using namespaces), each consisting of ~10 Deployment(scale=1) and 1 DaemonSet k8s objects. Totalling in ~15 pods.

We've introduced Karpenter 0.28.1 to assist with dynamic load, so that we can provision additional capacity.

The problem we are seeing is that while the "EKS Node Group" based nodes remain stable, the Karpenter nodes tend to get to:

Taints:             node.kubernetes.io/unreachable:NoExecute
                    node.kubernetes.io/unreachable:NoSchedule

Which than turns into node.kubernetes.io/unschedulable:NoSchedule

Which than gets killed by the Karpenter controller.

Expected Behavior:

Karpenter provisioned nodes should not become Unschedulable.

Reproduction Steps (Please include YAML):

I have managed to catch a snapshot of the a node going through this cycle. Please see a gist of kubectl describe node here taking 2 snapshots: 1st when node is marked as NoSchedule and the 2nd when it's already in the process of being terminated by Karpenter https://gist.github.com/maximveksler/48e303dc5782c90d7c6d4b5b167351f2

Provider spec:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: dev-capacity
spec:
  requirements:
    - key: "karpenter.sh/capacity-type" # If not included, the webhook for the AWS cloud provider will default to on-demand
      operator: In
      values: ["spot", "on-demand"]
    - key: kubernetes.io/arch
      operator: In
      values: ["amd64"]
    - key: kubernetes.io/os
      operator: In
      values:
        - linux
    - key: karpenter.k8s.aws/instance-category
      operator: In
      values:
        - c
        - m
        - r
        - t
    - key: "karpenter.k8s.aws/instance-cpu"
      operator: Gt
      values:
        - '10'
  consolidation:
    enabled: true
  providerRef:
    name: default
  kubeletConfiguration:
    systemReserved:
      cpu: 100m
      memory: 100Mi
      ephemeral-storage: 1Gi
    kubeReserved:
      cpu: 200m
      memory: 100Mi
      ephemeral-storage: 5Gi
    evictionHard:
      memory.available: 2%
      nodefs.available: 2%
      nodefs.inodesFree: 2%
    evictionSoft:
      memory.available: 200Mi
      nodefs.available: 5%
      nodefs.inodesFree: 5%
    evictionSoftGracePeriod:
      memory.available: 1m
      nodefs.available: 1m30s
      nodefs.inodesFree: 2m
    podsPerCore: 8
  limits:
    resources:
      cpu: 100
      memory: 3000Gi

---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: default
spec:
  subnetSelector:
    karpenter.sh/discovery: "${CLUSTER_NAME}"
  securityGroupSelector:
    karpenter.sh/discovery: "${CLUSTER_NAME}"
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 384Gi    

Full the complete installation steps, please see https://gist.github.com/maximveksler/38ec0cefa0ca2acccab748e71e5aebc0

Versions:

  • Chart Version: 0.28.1
  • Kubernetes Version (kubectl version):
kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"clean", BuildDate:"2023-06-14T09:47:38Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.14-eks-c12679a", GitCommit:"05d192f0de17608d98e17761ad3cffa9a6407f2f", GitTreeState:"clean", BuildDate:"2023-05-22T23:41:27Z", GoVersion:"go1.19.9", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.27) and server (1.24) exceeds the supported minor version skew of +/-1
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@maximveksler maximveksler added the bug Something isn't working label Jul 4, 2023
@maximveksler
Copy link
Author

discussion context (cross referencing) https://kubernetes.slack.com/archives/C02SFFZSA2K/p1688048847670199

@tzneal
Copy link
Contributor

tzneal commented Jul 4, 2023

Is the node going NotReady by itself, pods are evicted and then Karpenter removes the node?

-or-

Does Karpenter start to deprovision the node and then cordon/drain it?

Trying to determine if its the node going bad, or just standard consolidation.

@maximveksler
Copy link
Author

@tzneal I'm not sure.

We're currently not logging kubectl / node level logs so LMK if that's relevant and i'll make sure to catch it the next time, as this issue is a recurring problem for us, or if there's a different source for retrieving this information? In which case I'd appreciate guides for how to fetch.

@tzneal
Copy link
Contributor

tzneal commented Jul 10, 2023

Yes, when the node goes NotReady can you capture the logs with this log collector tool and supply them? It may contain sensitive data, so you can submit it as a support ticket.

https://github.com/awslabs/amazon-eks-ami/tree/master/log-collector-script/linux

@maximveksler
Copy link
Author

We've analyzed this issue internally. I don't think it's karpenter related directly.

The nodes have been terminating due resource over utilization, both memory and CPU. In both cases the kubelet process would become non responsive, which would eventually lead to not being marked unhealthy and than recycled by karpenter.

The problem is this being a dev rnd environments, with various experiments running side-by-side it's difficult to properly size pods resources. It's a matter of tolerance, for how much you over provision the requests & limit as sometimes the kernel cgroups allow to break ceiling temporary to allow the process to run when there is enough resources on the node. The problem is that for us at least, in various cases several environments decide to break budget which in turns generated the compounding effect.

To explain why it only started to appear with the introduction of karpenter - the 3x m5.4xlarge node group based nodes provided a healthy balance between number of pods, to amount of resources available. Thus, even if pods over utilized limits or in some cases didn't defined requests and limits at all, the environment would still be able to cope with that, and continue running normally. This is obviously not the case for karpenter ATM which mandates a strict node selection based on defined requests of k8s objects.

This turned it practically unusable for us in default mode, to work around it we've results to hard coding the lower bound of the node selection for the provisioner which works for us (for now).

cat <<EOF | kubectl apply -f -
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: dev-capacity
spec:
  requirements:
    - key: "karpenter.sh/capacity-type" # If not included, the webhook for the AWS cloud provider will default to on-demand
      operator: In
      values: ["spot", "on-demand"]
    - key: kubernetes.io/arch
      operator: In
      values: ["amd64"]
    - key: kubernetes.io/os
      operator: In
      values:
        - linux
    - key: "karpenter.k8s.aws/instance-hypervisor"
      operator: In
      values: ["nitro"]
    - key: karpenter.k8s.aws/instance-category
      operator: In
      values:
        - c
        - m
    - key: "karpenter.k8s.aws/instance-cpu"
      operator: Gt
      values:
        - '16'
    - key: "karpenter.k8s.aws/instance-memory"
      operator: Gt
      values:
        - '50000'
    - key: "topology.kubernetes.io/zone"
      operator: In
      values: ["eu-west-3c"]
  consolidation:
    enabled: true
  providerRef:
    name: default
  kubeletConfiguration:
    systemReserved:
      cpu: 100m
      memory: 100Mi
      ephemeral-storage: 1Gi
    kubeReserved:
      cpu: 200m
      memory: 100Mi
      ephemeral-storage: 5Gi
    evictionHard:
      memory.available: 2%
      nodefs.available: 2%
      nodefs.inodesFree: 2%
    evictionSoft:
      memory.available: 200Mi
      nodefs.available: 5%
      nodefs.inodesFree: 5%
    evictionSoftGracePeriod:
      memory.available: 1m
      nodefs.available: 1m30s
      nodefs.inodesFree: 2m
  limits:
    resources:
      cpu: 100
      memory: 3000Gi

---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: default
spec:
  subnetSelector:
    karpenter.sh/discovery: "${CLUSTER_NAME}"
  securityGroupSelector:
    karpenter.sh/discovery: "${CLUSTER_NAME}"
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 384Gi
EOF

The long term solution I believe should be ability to affect the provisioner resource calculation via a "profile", so that you will have as an example:

  1. Economic Profile (attempts to pick the most economical node, given the resource.requests value. This should be the default)
  2. Sparse Profile (takes as value how much "extra" resources to allocate during the node sizing calculation, per pod. Thus allowing it to respond dynamically based on cluster load)
  3. Fixed Capacity profile (allows to define "hard coded" memory, cpu & co requests. Which will be appended to the total resource sizing performed by karpenter during node sizing).

@tzneal what's your take on this? (btw, cool script. ty!)

@billrayburn billrayburn added the triage/needs-investigation Issues that need to be investigated before triaging label Aug 2, 2023
@jigisha620
Copy link
Contributor

I think the best way to get around this is to set request limits on pods to larger values. You can refer to this issue if you want to dynamically size the kubeReserved resources based on the instance type.

@billrayburn billrayburn assigned engedaam and unassigned jigisha620 May 22, 2024
@engedaam
Copy link
Contributor

engedaam commented Jun 5, 2024

Generally, you should be setting request and limits as Kubelet will allow workloads to burst up to the limits value that is configured. The case were users don't define limits workloads will consume as much resource as they need. The KubeReserved and systemReserved are maintained when calculating how much resource can fit on a node, not maximum resource that can be use by a pod: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#requests-and-limits This can result in some workloads consuming more resource then scheduler is expecting. The request and limits of the node would suggest that the pods could have been bursting beyond the capacity of what is available. The provisioning nodes Karpenter considers the resource requests and not the resources limits. Setting systemReserved may help, however the customer should consider adjusting their resources request and limits for their pods. If a container exceeds its memory request and the node that it runs on becomes short of memory overall, it is likely that the Pod the container belongs to will be evicted. A container might or might not be allowed to exceed its CPU limit for extended periods of time. However, container runtimes don't terminate Pods or containers for excessive CPU usage. If these pods were bursting and using more memory then requested, we can see this behaviors describe by the customer.

@engedaam engedaam closed this as completed Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage/needs-investigation Issues that need to be investigated before triaging
Projects
None yet
Development

No branches or pull requests

6 participants