diff --git a/website/content/en/preview/concepts/disruption.md b/website/content/en/preview/concepts/disruption.md index b4f309ed30b2..1c0693925053 100644 --- a/website/content/en/preview/concepts/disruption.md +++ b/website/content/en/preview/concepts/disruption.md @@ -95,7 +95,7 @@ When there are multiple nodes that could be potentially deleted or replaced, Kar If consolidation is enabled, Karpenter periodically reports events against nodes that indicate why the node can't be consolidated. These events can be used to investigate nodes that you expect to have been consolidated, but still remain in your cluster. -```console +```bash Events: Type Reason Age From Message ---- ------ ---- ---- ------- diff --git a/website/content/en/preview/concepts/nodeclasses.md b/website/content/en/preview/concepts/nodeclasses.md index ca668aef11f0..718e7eea3253 100644 --- a/website/content/en/preview/concepts/nodeclasses.md +++ b/website/content/en/preview/concepts/nodeclasses.md @@ -131,7 +131,7 @@ AMIFamily is a required field, dictating both the default bootstrapping logic fo ### AL2 -```console +```bash MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="//" @@ -166,7 +166,7 @@ max-pods = 110 ### Ubuntu -```console +```bash MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="//" diff --git a/website/content/en/preview/concepts/nodepools.md b/website/content/en/preview/concepts/nodepools.md index 73a0704d227a..68c2eb5726cf 100644 --- a/website/content/en/preview/concepts/nodepools.md +++ b/website/content/en/preview/concepts/nodepools.md @@ -6,116 +6,140 @@ description: > Configure Karpenter with NodePools --- -When you first installed Karpenter, you set up a default Provisioner. -The Provisioner sets constraints on the nodes that can be created by Karpenter and the pods that can run on those nodes. -The Provisioner can be set to do things like: +When you first installed Karpenter, you set up a default NodePool. +The NodePool sets constraints on the nodes that can be created by Karpenter and the pods that can run on those nodes. +The NodePool can be set to do things like: * Define taints to limit the pods that can run on nodes Karpenter creates * Define any startup taints to inform Karpenter that it should taint the node initially, but that the taint is temporary. * Limit node creation to certain zones, instance types, and computer architectures * Set defaults for node expiration -You can change your Provisioner or add other Provisioners to Karpenter. -Here are things you should know about Provisioners: +You can change your NodePool or add other NodePools to Karpenter. +Here are things you should know about NodePools: -* Karpenter won't do anything if there is not at least one Provisioner configured. -* Each Provisioner that is configured is looped through by Karpenter. -* If Karpenter encounters a taint in the Provisioner that is not tolerated by a Pod, Karpenter won't use that Provisioner to provision the pod. -* If Karpenter encounters a startup taint in the Provisioner it will be applied to nodes that are provisioned, but pods do not need to tolerate the taint. Karpenter assumes that the taint is temporary and some other system will remove the taint. -* It is recommended to create Provisioners that are mutually exclusive. So no Pod should match multiple Provisioners. If multiple Provisioners are matched, Karpenter will use the Provisioner with the highest [weight](#specweight). +* Karpenter won't do anything if there is not at least one NodePool configured. +* Each NodePool that is configured is looped through by Karpenter. +* If Karpenter encounters a taint in the NodePool that is not tolerated by a Pod, Karpenter won't use that NodePool to provision the pod. +* If Karpenter encounters a startup taint in the NodePool it will be applied to nodes that are provisioned, but pods do not need to tolerate the taint. Karpenter assumes that the taint is temporary and some other system will remove the taint. +* It is recommended to create NodePools that are mutually exclusive. So no Pod should match multiple NodePools. If multiple NodePools are matched, Karpenter will use the NodePool with the highest [weight](#specweight). -For some example `Provisioner` configurations, see the [examples in the Karpenter GitHub repository](https://github.com/aws/karpenter/blob/main/examples/provisioner/). +For some example `NodePool` configurations, see the [examples in the Karpenter GitHub repository](https://github.com/aws/karpenter/blob/main/examples/nodepool/). ```yaml -apiVersion: karpenter.sh/v1alpha5 -kind: Provisioner +apiVersion: karpenter.sh/v1beta1 +kind: NodePool metadata: name: default spec: - # References cloud provider-specific custom resource, see your cloud provider specific documentation - providerRef: - name: default - - # Provisioned nodes will have these taints - # Taints may prevent pods from scheduling if they are not tolerated by the pod. - taints: - - key: example.com/special-taint - effect: NoSchedule - - # Provisioned nodes will have these taints, but pods do not need to tolerate these taints to be provisioned by this - # provisioner. These taints are expected to be temporary and some other entity (e.g. a DaemonSet) is responsible for - # removing the taint after it has finished initializing the node. - startupTaints: - - key: example.com/another-taint - effect: NoSchedule - - # Labels are arbitrary key-values that are applied to all nodes - labels: - billing-team: my-team - - # Annotations are arbitrary key-values that are applied to all nodes - annotations: - example.com/owner: "my-team" - - # Requirements that constrain the parameters of provisioned nodes. - # These requirements are combined with pod.spec.affinity.nodeAffinity rules. - # Operators { In, NotIn, Exists, DoesNotExist, Gt, and Lt } are supported. - # https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#operators - requirements: - - key: "karpenter.k8s.aws/instance-category" - operator: In - values: ["c", "m", "r"] - - key: "karpenter.k8s.aws/instance-cpu" - operator: In - values: ["4", "8", "16", "32"] - - key: "karpenter.k8s.aws/instance-hypervisor" - operator: In - values: ["nitro"] - - key: "karpenter.k8s.aws/instance-generation" - operator: Gt - values: ["2"] - - key: "topology.kubernetes.io/zone" - operator: In - values: ["us-west-2a", "us-west-2b"] - - key: "kubernetes.io/arch" - operator: In - values: ["arm64", "amd64"] - - key: "karpenter.sh/capacity-type" # If not included, the webhook for the AWS cloud provider will default to on-demand - operator: In - values: ["spot", "on-demand"] - - # Karpenter provides the ability to specify a few additional Kubelet args. - # These are all optional and provide support for additional customization and use cases. - kubeletConfiguration: - clusterDNS: ["10.0.1.100"] - containerRuntime: containerd - systemReserved: - cpu: 100m - memory: 100Mi - ephemeral-storage: 1Gi - kubeReserved: - cpu: 200m - memory: 100Mi - ephemeral-storage: 3Gi - evictionHard: - memory.available: 5% - nodefs.available: 10% - nodefs.inodesFree: 10% - evictionSoft: - memory.available: 500Mi - nodefs.available: 15% - nodefs.inodesFree: 15% - evictionSoftGracePeriod: - memory.available: 1m - nodefs.available: 1m30s - nodefs.inodesFree: 2m - evictionMaxPodGracePeriod: 60 - imageGCHighThresholdPercent: 85 - imageGCLowThresholdPercent: 80 - cpuCFSQuota: true - podsPerCore: 2 - maxPods: 20 - + # Template section that describes how to template out NodeClaim resources that Karpenter will provision + # Karpenter will consider this template to be the minimum requirements needed to provision a Node using this NodePool + # It will overlay this NodePool with Pods that need to schedule to further constrain the NodeClaims + # Karpenter will provision to launch new Nodes for the cluster + template: + metadata: + # Labels are arbitrary key-values that are applied to all nodes + labels: + billing-team: my-team + + # Annotations are arbitrary key-values that are applied to all nodes + annotations: + example.com/owner: "my-team" + spec: + # References the Cloud Provider's NodeClass resource, see your cloud provider specific documentation + nodeClassRef: + name: default + + # Provisioned nodes will have these taints + # Taints may prevent pods from scheduling if they are not tolerated by the pod. + taints: + - key: example.com/special-taint + effect: NoSchedule + + # Provisioned nodes will have these taints, but pods do not need to tolerate these taints to be provisioned by this + # NodePool. These taints are expected to be temporary and some other entity (e.g. a DaemonSet) is responsible for + # removing the taint after it has finished initializing the node. + startupTaints: + - key: example.com/another-taint + effect: NoSchedule + + # Requirements that constrain the parameters of provisioned nodes. + # These requirements are combined with pod.spec.affinity.nodeAffinity rules. + # Operators { In, NotIn, Exists, DoesNotExist, Gt, and Lt } are supported. + # https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#operators + requirements: + - key: "karpenter.k8s.aws/instance-category" + operator: In + values: ["c", "m", "r"] + - key: "karpenter.k8s.aws/instance-cpu" + operator: In + values: ["4", "8", "16", "32"] + - key: "karpenter.k8s.aws/instance-hypervisor" + operator: In + values: ["nitro"] + - key: "karpenter.k8s.aws/instance-generation" + operator: Gt + values: ["2"] + - key: "topology.kubernetes.io/zone" + operator: In + values: ["us-west-2a", "us-west-2b"] + - key: "kubernetes.io/arch" + operator: In + values: ["arm64", "amd64"] + - key: "karpenter.sh/capacity-type" # If not included, the webhook for the AWS cloud provider will default to on-demand + operator: In + values: ["spot", "on-demand"] + + # Karpenter provides the ability to specify a few additional Kubelet args. + # These are all optional and provide support for additional customization and use cases. + kubelet: + clusterDNS: ["10.0.1.100"] + containerRuntime: containerd + systemReserved: + cpu: 100m + memory: 100Mi + ephemeral-storage: 1Gi + kubeReserved: + cpu: 200m + memory: 100Mi + ephemeral-storage: 3Gi + evictionHard: + memory.available: 5% + nodefs.available: 10% + nodefs.inodesFree: 10% + evictionSoft: + memory.available: 500Mi + nodefs.available: 15% + nodefs.inodesFree: 15% + evictionSoftGracePeriod: + memory.available: 1m + nodefs.available: 1m30s + nodefs.inodesFree: 2m + evictionMaxPodGracePeriod: 60 + imageGCHighThresholdPercent: 85 + imageGCLowThresholdPercent: 80 + cpuCFSQuota: true + podsPerCore: 2 + maxPods: 20 + + # Disruption section which describes the ways in which Karpenter can disrupt and replace Nodes + # Configuration in this section constrains how aggressive Karpenter can be with performing operations + # like rolling Nodes due to them hitting their maximum lifetime (expiry) or scaling down nodes to reduce cluster cost + disruption: + # Describes which types of Nodes Karpenter should consider for consolidation + # If using 'WhenUnderutilized', Karpenter will consider all nodes for consolidation and attempt to remove or replace Nodes when it discovers that the Node is underutilized and could be changed to reduce cost + # If using `WhenEmpty`, Karpenter will only consider nodes for consolidation that contain no workload pods + consolidationPolicy: WhenUnderutilized | WhenEmpty + + # The amount of time Karpenter should wait after discovering a consolidation decision + # This value can currently only be set when the consolidationPolicy is 'WhenEmpty' + # You can choose to disable consolidation entirely by setting the string value 'Never' here + consolidateAfter: 30s + + # The amount of time a Node can live on the cluster before being removed + # Avoiding long-running Nodes helps to reduce security vulnerabilities as well as to reduce the chance of issues that can plague Nodes with long uptimes such as file fragmentation or memory leaks from system processes + # You can choose to disable expiration entirely by setting the string value 'Never' here + expireAfter: 720h # Resource limits constrain the total size of the cluster. # Limits prevent Karpenter from creating new instances once the limit is exceeded. @@ -124,35 +148,21 @@ spec: cpu: "1000" memory: 1000Gi - # Enables consolidation which attempts to reduce cluster cost by both removing un-needed nodes and down-sizing those - # that can't be removed. Mutually exclusive with the ttlSecondsAfterEmpty parameter. - consolidation: - enabled: true - - # If omitted, the feature is disabled and nodes will never expire. If set to less time than it requires for a node - # to become ready, the node may expire before any pods successfully start. - ttlSecondsUntilExpired: 2592000 # 30 Days = 60 * 60 * 24 * 30 Seconds; - - # If omitted, the feature is disabled, nodes will never scale down due to low utilization - ttlSecondsAfterEmpty: 30 - - # Priority given to the provisioner when the scheduler considers which provisioner - # to select. Higher weights indicate higher priority when comparing provisioners. + # Priority given to the NodePool when the scheduler considers which NodePool + # to select. Higher weights indicate higher priority when comparing NodePools. # Specifying no weight is equivalent to specifying a weight of 0. weight: 10 ``` -## spec.requirements +## spec.template.spec.requirements -Kubernetes defines the following [Well-Known Labels](https://kubernetes.io/docs/reference/labels-annotations-taints/), and cloud providers (e.g., AWS) implement them. They are defined at the "spec.requirements" section of the Provisioner API. +Kubernetes defines the following [Well-Known Labels](https://kubernetes.io/docs/reference/labels-annotations-taints/), and cloud providers (e.g., AWS) implement them. They are defined at the "spec.requirements" section of the NodePool API. In addition to the well-known labels from Kubernetes, Karpenter supports AWS-specific labels for more advanced scheduling. See the full list [here](../scheduling/#well-known-labels). -These well-known labels may be specified at the provisioner level, or in a workload definition (e.g., nodeSelector on a pod.spec). Nodes are chosen using both the provisioner's and pod's requirements. If there is no overlap, nodes will not be launched. In other words, a pod's requirements must be within the provisioner's requirements. If a requirement is not defined for a well known label, any value available to the cloud provider may be chosen. - -For example, an instance type may be specified using a nodeSelector in a pod spec. If the instance type requested is not included in the provisioner list and the provisioner has instance type requirements, Karpenter will not create a node or schedule the pod. +These well-known labels may be specified at the NodePool level, or in a workload definition (e.g., nodeSelector on a pod.spec). Nodes are chosen using both the NodePool's and pod's requirements. If there is no overlap, nodes will not be launched. In other words, a pod's requirements must be within the NodePool's requirements. If a requirement is not defined for a well known label, any value available to the cloud provider may be chosen. -📝 None of these values are required. +For example, an instance type may be specified using a nodeSelector in a pod spec. If the instance type requested is not included in the NodePool list and the NodePool has instance type requirements, Karpenter will not create a node or schedule the pod. ### Instance Types @@ -165,20 +175,6 @@ Generally, instance types should be a list and not a single value. Leaving these Review [AWS instance types](../instance-types). Most instance types are supported with the exclusion of [non-HVM](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/virtualization_types.html). -{{% alert title="Defaults" color="secondary" %}} -If no instance type constraints are defined, Karpenter will set default instance type constraints on your Provisioner that supports most common user workloads: - -```yaml -requirements: - - key: karpenter.k8s.aws/instance-category - operator: In - values: ["c", "m", "r"] - - key: karpenter.k8s.aws/instance-generation - operator: Gt - values: ["2"] -``` -{{% /alert %}} - ### Availability Zones - key: `topology.kubernetes.io/zone` @@ -199,17 +195,6 @@ IDs.](https://docs.aws.amazon.com/ram/latest/userguide/working-with-az-ids.html) Karpenter supports `amd64` nodes, and `arm64` nodes. -{{% alert title="Defaults" color="secondary" %}} -If no architecture constraint is defined, Karpenter will set the default architecture constraint on your Provisioner that supports most common user workloads: - -```yaml -requirements: - - key: kubernetes.io/arch - operator: In - values: ["amd64"] -``` -{{% /alert %}} - ### Operating System - key: `kubernetes.io/os` - values @@ -218,17 +203,6 @@ requirements: Karpenter supports `linux` and `windows` operating systems. -{{% alert title="Defaults" color="secondary" %}} -If no operating system constraint is defined, Karpenter will set the default operating system constraint on your Provisioner that supports most common user workloads: - -```yaml -requirements: - - key: kubernetes.io/os - operator: In - values: ["linux"] -``` -{{% /alert %}} - ### Capacity Type - key: `karpenter.sh/capacity-type` @@ -238,64 +212,78 @@ requirements: Karpenter supports specifying capacity type, which is analogous to [EC2 purchase options](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-purchasing-options.html). -Karpenter prioritizes Spot offerings if the provisioner allows Spot and on-demand instances. If the provider API (e.g. EC2 Fleet's API) indicates Spot capacity is unavailable, Karpenter caches that result across all attempts to provision EC2 capacity for that instance type and zone for the next 45 seconds. If there are no other possible offerings available for Spot, Karpenter will attempt to provision on-demand instances, generally within milliseconds. +Karpenter prioritizes Spot offerings if the NodePool allows Spot and on-demand instances. If the provider API (e.g. EC2 Fleet's API) indicates Spot capacity is unavailable, Karpenter caches that result across all attempts to provision EC2 capacity for that instance type and zone for the next 45 seconds. If there are no other possible offerings available for Spot, Karpenter will attempt to provision on-demand instances, generally within milliseconds. Karpenter also allows `karpenter.sh/capacity-type` to be used as a topology key for enforcing topology-spread. -{{% alert title="Defaults" color="secondary" %}} -If no capacity type constraint is defined, Karpenter will set the default capacity type constraint on your Provisioner that supports most common user workloads: +{{% alert title="Recommended" color="primary" %}} +Karpenter allows you to be extremely flexible with your NodePools by only constraining your instance types in ways that are absolutely necessary for your cluster. By default, Karpenter will enforce that you specify the `spec.template.spec.requirements` field, but will not enforce that you specify any requirements within the field. If you choose to specify `requirements: []`, this means that you will completely flexible to _all_ instance types that your cloud provider supports. + +Though Karpenter doesn't enforce these defaults, for most use-cases, we recommend that you specify _some_ requirements to avoid odd behavior or exotic instance types. Below, is a high-level recommendation for requirements that should fit the majority of use-cases for generic workloads ```yaml -requirements: - - key: karpenter.sh/capacity-type - operator: In - values: ["on-demand"] +spec: + template: + spec: + requirements: + - key: kubernetes.io/arch + operator: In + values: ["amd64"] + - key: kubernetes.io/os + operator: In + values: ["linux"] + - key: karpenter.sh/capacity-type + operator: In + values: ["on-demand"] + - key: karpenter.k8s.aws/instance-category + operator: In + values: ["c", "m", "r"] + - key: karpenter.k8s.aws/instance-generation + operator: Gt + values: ["2"] ``` -{{% /alert %}} -## spec.weight +{{% /alert %}} -Karpenter allows you to describe provisioner preferences through a `weight` mechanism similar to how weight is described with [pod and node affinities](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity). +## spec.template.spec.nodeClassRef -For more information on weighting Provisioners, see the [Weighting Provisioners section](../scheduling#weighting-provisioners) in the scheduling details. +This field points to the Cloud Provider NodeClass resource. Learn more about [EC2NodeClasses]({{}}). -## spec.kubeletConfiguration +## spec.template.spec.kubelet Karpenter provides the ability to specify a few additional Kubelet args. These are all optional and provide support for additional customization and use cases. Adjust these only if you know you need to do so. For more details on kubelet configuration arguments, [see the KubeletConfiguration API specification docs](https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/). The implemented fields are a subset of the full list of upstream kubelet configuration arguments. Please cut an issue if you'd like to see another field implemented. ```yaml -spec: - ... - kubeletConfiguration: - clusterDNS: ["10.0.1.100"] - containerRuntime: containerd - systemReserved: - cpu: 100m - memory: 100Mi - ephemeral-storage: 1Gi - kubeReserved: - cpu: 200m - memory: 100Mi - ephemeral-storage: 3Gi - evictionHard: - memory.available: 5% - nodefs.available: 10% - nodefs.inodesFree: 10% - evictionSoft: - memory.available: 500Mi - nodefs.available: 15% - nodefs.inodesFree: 15% - evictionSoftGracePeriod: - memory.available: 1m - nodefs.available: 1m30s - nodefs.inodesFree: 2m - evictionMaxPodGracePeriod: 60 - imageGCHighThresholdPercent: 85 - imageGCLowThresholdPercent: 80 - cpuCFSQuota: true - podsPerCore: 2 - maxPods: 20 +kubelet: + clusterDNS: ["10.0.1.100"] + containerRuntime: containerd + systemReserved: + cpu: 100m + memory: 100Mi + ephemeral-storage: 1Gi + kubeReserved: + cpu: 200m + memory: 100Mi + ephemeral-storage: 3Gi + evictionHard: + memory.available: 5% + nodefs.available: 10% + nodefs.inodesFree: 10% + evictionSoft: + memory.available: 500Mi + nodefs.available: 15% + nodefs.inodesFree: 15% + evictionSoftGracePeriod: + memory.available: 1m + nodefs.available: 1m30s + nodefs.inodesFree: 2m + evictionMaxPodGracePeriod: 60 + imageGCHighThresholdPercent: 85 + imageGCLowThresholdPercent: 80 + cpuCFSQuota: true + podsPerCore: 2 + maxPods: 20 ``` You can specify the container runtime to be either `dockerd` or `containerd`. By default, `containerd` is used. @@ -304,7 +292,7 @@ You can specify the container runtime to be either `dockerd` or `containerd`. By ### Reserved Resources -Karpenter will automatically configure the system and kube reserved resource requests on the fly on your behalf. These requests are used to configure your node and to make scheduling decisions for your pods. If you have specific requirements or know that you will have additional capacity requirements, you can optionally override the `--system-reserved` configuration defaults with the `.spec.kubeletConfiguration.systemReserved` values and the `--kube-reserved` configuration defaults with the `.spec.kubeletConfiguration.kubeReserved` values. +Karpenter will automatically configure the system and kube reserved resource requests on the fly on your behalf. These requests are used to configure your node and to make scheduling decisions for your pods. If you have specific requirements or know that you will have additional capacity requirements, you can optionally override the `--system-reserved` configuration defaults with the `.spec.template.spec.kubelet.systemReserved` values and the `--kube-reserved` configuration defaults with the `.spec.template.spec.kubelet.kubeReserved` values. For more information on the default `--system-reserved` and `--kube-reserved` configuration refer to the [Kubelet Docs](https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#kube-reserved) @@ -314,38 +302,36 @@ The kubelet supports eviction thresholds by default. When enough memory or file Kubelet has the notion of [hard evictions](https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#hard-eviction-thresholds) and [soft evictions](https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#soft-eviction-thresholds). In hard evictions, pods are evicted as soon as a threshold is met, with no grace period to terminate. Soft evictions, on the other hand, provide an opportunity for pods to be terminated gracefully. They do so by sending a termination signal to pods that are planning to be evicted and allowing those pods to terminate up to their grace period. -Karpenter supports [hard evictions](https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#hard-eviction-thresholds) through the `.spec.kubeletConfiguration.evictionHard` field and [soft evictions](https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#soft-eviction-thresholds) through the `.spec.kubeletConfiguration.evictionSoft` field. `evictionHard` and `evictionSoft` are configured by listing [signal names](https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#eviction-signals) with either percentage values or resource values. +Karpenter supports [hard evictions](https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#hard-eviction-thresholds) through the `.spec.template.spec.kubelet.evictionHard` field and [soft evictions](https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#soft-eviction-thresholds) through the `.spec.template.spec.kubelet.evictionSoft` field. `evictionHard` and `evictionSoft` are configured by listing [signal names](https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#eviction-signals) with either percentage values or resource values. ```yaml -spec: - ... - kubeletConfiguration: - evictionHard: - memory.available: 500Mi - nodefs.available: 10% - nodefs.inodesFree: 10% - imagefs.available: 5% - imagefs.inodesFree: 5% - pid.available: 7% - evictionSoft: - memory.available: 1Gi - nodefs.available: 15% - nodefs.inodesFree: 15% - imagefs.available: 10% - imagefs.inodesFree: 10% - pid.available: 10% +kubelet: + evictionHard: + memory.available: 500Mi + nodefs.available: 10% + nodefs.inodesFree: 10% + imagefs.available: 5% + imagefs.inodesFree: 5% + pid.available: 7% + evictionSoft: + memory.available: 1Gi + nodefs.available: 15% + nodefs.inodesFree: 15% + imagefs.available: 10% + imagefs.inodesFree: 10% + pid.available: 10% ``` #### Supported Eviction Signals -| Eviction Signal | Description | -| --------------- | ----------- | -| memory.available | memory.available := node.status.capacity[memory] - node.stats.memory.workingSet | -| nodefs.available | nodefs.available := node.stats.fs.available | -| nodefs.inodesFree | nodefs.inodesFree := node.stats.fs.inodesFree | -| imagefs.available | imagefs.available := node.stats.runtime.imagefs.available | -| imagefs.inodesFree | imagefs.inodesFree := node.stats.runtime.imagefs.inodesFree | -| pid.available | pid.available := node.stats.rlimit.maxpid - node.stats.rlimit.curproc | +| Eviction Signal | Description | +|--------------------|---------------------------------------------------------------------------------| +| memory.available | memory.available := node.status.capacity[memory] - node.stats.memory.workingSet | +| nodefs.available | nodefs.available := node.stats.fs.available | +| nodefs.inodesFree | nodefs.inodesFree := node.stats.fs.inodesFree | +| imagefs.available | imagefs.available := node.stats.runtime.imagefs.available | +| imagefs.inodesFree | imagefs.inodesFree := node.stats.runtime.imagefs.inodesFree | +| pid.available | pid.available := node.stats.rlimit.maxpid - node.stats.rlimit.curproc | For more information on eviction thresholds, view the [Node-pressure Eviction](https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction) section of the official Kubernetes docs. @@ -356,64 +342,83 @@ Soft eviction pairs an eviction threshold with a specified grace period. With so Optionally, you can specify an `evictionMaxPodGracePeriod` which defines the administrator-specified maximum pod termination grace period to use during soft eviction. If a namespace-owner had specified a pod `terminationGracePeriodInSeconds` on pods in their namespace, the minimum of `evictionPodGracePeriod` and `terminationGracePeriodInSeconds` would be used. ```yaml -spec: - ... - kubeletConfiguration: - evictionSoftGracePeriod: - memory.available: 1m - nodefs.available: 1m30s - nodefs.inodesFree: 2m - imagefs.available: 1m30s - imagefs.inodesFree: 2m - pid.available: 2m - evictionMaxPodGracePeriod: 60 +kubelet: + evictionSoftGracePeriod: + memory.available: 1m + nodefs.available: 1m30s + nodefs.inodesFree: 2m + imagefs.available: 1m30s + imagefs.inodesFree: 2m + pid.available: 2m + evictionMaxPodGracePeriod: 60 ``` ### Pod Density +By default, the number of pods on a node is limited by both the number of networking interfaces (ENIs) that may be attached to an instance type and the number of IP addresses that can be assigned to each ENI. See [IP addresses per network interface per instance type](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html#AvailableIpPerENI) for a more detailed information on these instance types' limits. + +{{% alert title="Note" color="primary" %}} +By default, the VPC CNI allocates IPs for a node and pods from the same subnet. With [VPC CNI Custom Networking](https://aws.github.io/aws-eks-best-practices/networking/custom-networking), the pods will receive IP addresses from another subnet dedicated to pod IPs. This approach makes it easier to manage IP addresses and allows for separate Network Access Control Lists (NACLs) applied to your pods. VPC CNI Custom Networking reduces the pod density of a node since one of the ENI attachments will be used for the node and cannot share the allocated IPs on the interface to pods. Karpenter supports VPC CNI Custom Networking and similar CNI setups where the primary node interface is separated from the pods interfaces through a global [setting](./settings.md#configmap) within the karpenter-global-settings configmap: `aws.reservedENIs`. In the common case, `aws.reservedENIs` should be set to `"1"` if using Custom Networking. +{{% /alert %}} + +{{% alert title="Windows Support Notice" color="warning" %}} +It's currently not possible to specify custom networking with Windows nodes. +{{% /alert %}} + #### Max Pods -By default, AWS will configure the maximum density of pods on a node [based on the node instance type](https://github.com/awslabs/amazon-eks-ami/blob/master/files/eni-max-pods.txt). For small instances that require an increased pod density or large instances that require a reduced pod density, you can override this default value with `.spec.kubeletConfiguration.maxPods`. This value will be used during Karpenter pod scheduling and passed through to `--max-pods` on kubelet startup. +For small instances that require an increased pod density or large instances that require a reduced pod density, you can override this default value with `.spec.template.spec.kubelet.maxPods`. This value will be used during Karpenter pod scheduling and passed through to `--max-pods` on kubelet startup. {{% alert title="Note" color="primary" %}} When using small instance types, it may be necessary to enable [prefix assignment mode](https://aws.amazon.com/blogs/containers/amazon-vpc-cni-increases-pods-per-node-limits/) in the AWS VPC CNI plugin to support a higher pod density per node. Prefix assignment mode was introduced in AWS VPC CNI v1.9 and allows ENIs to manage a broader set of IP addresses. Much higher pod densities are supported as a result. {{% /alert %}} +{{% alert title="Windows Support Notice" color="warning" %}} +Presently, Windows worker nodes do not support using more than one ENI. +As a consequence, the number of IP addresses, and subsequently, the number of pods that a Windows worker node can support is limited by the number of IPv4 addresses available on the primary ENI. +Currently, Karpenter will only consider individual secondary IP addresses when calculating the pod density limit. +{{% /alert %}} + #### Pods Per Core -An alternative way to dynamically set the maximum density of pods on a node is to use the `.spec.kubeletConfiguration.podsPerCore` value. Karpenter will calculate the pod density during scheduling by multiplying this value by the number of logical cores (vCPUs) on an instance type. This value will also be passed through to the `--pods-per-core` value on kubelet startup to configure the number of allocatable pods the kubelet can assign to the node instance. +An alternative way to dynamically set the maximum density of pods on a node is to use the `.spec.template.spec.kubelet.podsPerCore` value. Karpenter will calculate the pod density during scheduling by multiplying this value by the number of logical cores (vCPUs) on an instance type. This value will also be passed through to the `--pods-per-core` value on kubelet startup to configure the number of allocatable pods the kubelet can assign to the node instance. The value generated from `podsPerCore` cannot exceed `maxPods`, meaning, if both are set, the minimum of the `podsPerCore` dynamic pod density and the static `maxPods` value will be used for scheduling. {{% alert title="Note" color="primary" %}} -`maxPods` may not be set in the `kubeletConfiguration` of a Provisioner, but may still be restricted by the `ENI_LIMITED_POD_DENSITY` value. You may want to ensure that the `podsPerCore` value that will be used for instance families associated with the Provisioner will not cause unexpected behavior by exceeding the `maxPods` value. +`maxPods` may not be set in the `kubelet` of a NodePool, but may still be restricted by the `ENI_LIMITED_POD_DENSITY` value. You may want to ensure that the `podsPerCore` value that will be used for instance families associated with the NodePool will not cause unexpected behavior by exceeding the `maxPods` value. {{% /alert %}} {{% alert title="Pods Per Core on Bottlerocket" color="warning" %}} -Bottlerocket AMIFamily currently does not support `podsPerCore` configuration. If a Provisioner contains a `provider` or `providerRef` to a node template that will launch a Bottlerocket instance, the `podsPerCore` value will be ignored for scheduling and for configuring the kubelet. +Bottlerocket AMIFamily currently does not support `podsPerCore` configuration. If a NodePool contains a `provider` or `providerRef` to a node template that will launch a Bottlerocket instance, the `podsPerCore` value will be ignored for scheduling and for configuring the kubelet. {{% /alert %}} -## spec.limits.resources +## spec.disruption + +You can configure Karpenter to disrupt Nodes through your NodePool in multiple ways. You can use `spec.disruption.consolidationPolicy`, `spec.disruption.consolidateAfter` or `spec.disruption.expireAfter`. Read [Disruption]({{}}) for more. -The provisioner spec includes a limits section (`spec.limits.resources`), which constrains the maximum amount of resources that the provisioner will manage. +## spec.limits + +The NodePool spec includes a limits section (`spec.limits`), which constrains the maximum amount of resources that the NodePool will manage. Karpenter supports limits of any resource type reported by your cloudprovider. It limits instance types when scheduling to those that will not exceed the specified limits. If a limit has been exceeded, nodes provisioning is prevented until some nodes have been terminated. ```yaml -apiVersion: karpenter.sh/v1alpha5 -kind: Provisioner +apiVersion: karpenter.sh/v1beta1 +kind: NodePool metadata: name: default spec: - requirements: - - key: karpenter.sh/capacity-type - operator: In - values: ["spot"] + template: + spec: + requirements: + - key: karpenter.sh/capacity-type + operator: In + values: ["spot"] limits: - resources: - cpu: 1000 - memory: 1000Gi - nvidia.com/gpu: 2 + cpu: 1000 + memory: 1000Gi + nvidia.com/gpu: 2 ``` {{% alert title="Note" color="primary" %}} @@ -426,59 +431,61 @@ Memory limits are described with a [`BinarySI` value, such as 1000Gi.](https://k You can view the current consumption of cpu and memory on your cluster by running: ``` -kubectl get provisioner -o=jsonpath='{.items[0].status}' +kubectl get nodepool -o=jsonpath='{.items[0].status}' ``` Review the [Kubernetes core API](https://github.com/kubernetes/api/blob/37748cca582229600a3599b40e9a82a951d8bbbf/core/v1/resource.go#L23) (`k8s.io/api/core/v1`) for more information on `resources`. -## spec.providerRef - -This field points to the cloud provider-specific custom resource. Learn more about [AWSNodeTemplates](../node-templates/). +## spec.weight -## spec.consolidation +Karpenter allows you to describe NodePool preferences through a `weight` mechanism similar to how weight is described with [pod and node affinities](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity). -You can configure Karpenter to deprovision instances through your Provisioner in multiple ways. You can use `spec.ttlSecondsAfterEmpty`, `spec.ttlSecondsUntilExpired` or `spec.consolidation.enabled`. Read [Deprovisioning]({{}}) for more. +For more information on weighting NodePools, see the [Weighting NodePools section]({{}}) in the scheduling details. -## Example Use-Cases +## Examples ### Isolating Expensive Hardware -A provisioner can be set up to only provision nodes on particular processor types. +A NodePool can be set up to only provision nodes on particular processor types. The following example sets a taint that only allows pods with tolerations for Nvidia GPUs to be scheduled: ```yaml -apiVersion: karpenter.sh/v1alpha5 -kind: Provisioner +apiVersion: karpenter.sh/v1beta1 +kind: NodePool metadata: name: gpu spec: - consolidation: - enabled: true - requirements: - - key: node.kubernetes.io/instance-type - operator: In - values: ["p3.8xlarge", "p3.16xlarge"] - taints: - - key: nvidia.com/gpu - value: "true" - effect: NoSchedule + disruption: + consolidationPolicy: WhenUnderutilized + template: + spec: + requirements: + - key: node.kubernetes.io/instance-type + operator: In + values: ["p3.8xlarge", "p3.16xlarge"] + taints: + - key: nvidia.com/gpu + value: "true" + effect: NoSchedule ``` -In order for a pod to run on a node defined in this provisioner, it must tolerate `nvidia.com/gpu` in its pod spec. +In order for a pod to run on a node defined in this NodePool, it must tolerate `nvidia.com/gpu` in its pod spec. ### Cilium Startup Taint Per the Cilium [docs](https://docs.cilium.io/en/stable/installation/taints/#taint-effects), it's recommended to place a taint of `node.cilium.io/agent-not-ready=true:NoExecute` on nodes to allow Cilium to configure networking prior to other pods starting. This can be accomplished via the use of Karpenter `startupTaints`. These taints are placed on the node, but pods aren't required to tolerate these taints to be considered for provisioning. ```yaml -apiVersion: karpenter.sh/v1alpha5 -kind: Provisioner +apiVersion: karpenter.sh/v1beta1 +kind: NodePool metadata: name: cilium-startup spec: - consolidation: - enabled: true - startupTaints: - - key: node.cilium.io/agent-not-ready - value: "true" - effect: NoExecute + disruption: + consolidationPolicy: WhenUnderutilized + template: + spec: + startupTaints: + - key: node.cilium.io/agent-not-ready + value: "true" + effect: NoExecute ``` diff --git a/website/content/en/preview/concepts/pod-density.md b/website/content/en/preview/concepts/pod-density.md deleted file mode 100644 index 7d1c12f5d326..000000000000 --- a/website/content/en/preview/concepts/pod-density.md +++ /dev/null @@ -1,91 +0,0 @@ ---- -title: "Control Pod Density" -linkTitle: "Control Pod Density" -weight: 6 -description: > - Learn ways to specify pod density with Karpenter ---- - -Pod density is the number of pods per node. - -Kubernetes has a default limit of 110 pods per node. If you are using the EKS Optimized AMI on AWS, the [number of pods is limited by instance type](https://github.com/awslabs/amazon-eks-ami/blob/master/files/eni-max-pods.txt) in the default configuration. - -## Increase Pod Density - -### Networking Limitations - -*☁️ AWS Specific* - -By default, the number of pods on a node is limited by both the number of networking interfaces (ENIs) that may be attached to an instance type and the number of IP addresses that can be assigned to each ENI. See [IP addresses per network interface per instance type](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html#AvailableIpPerENI) for a more detailed information on these instance types' limits. - -Karpenter can be configured to disable nodes' ENI-based pod density. This is especially useful for small to medium instance types which have a lower ENI-based pod density. - -{{% alert title="Note" color="primary" %}} -When using small instance types, it may be necessary to enable [prefix assignment mode](https://aws.amazon.com/blogs/containers/amazon-vpc-cni-increases-pods-per-node-limits/) in the AWS VPC CNI plugin to more pods per node. Prefix assignment mode was introduced in AWS VPC CNI v1.9 and allows ENIs to manage a broader set of IP addresses. Much higher pod densities are supported as a result. -{{% /alert %}} - -{{% alert title="Windows Support Notice" color="warning" %}} -Presently, Windows worker nodes do not support using more than one ENI. -As a consequence, the number of IP addresses, and subsequently, the number of pods that a Windows worker node can support is limited by the number of IPv4 addresses available on the primary ENI. -At the moment, Karpenter will only consider individual secondary IP addresses when calculating the pod density limit. -{{% /alert %}} - -### Provisioner-Specific Pod Density - -#### Static Pod Density - -Static pod density can be configured at the provisioner level by specifying `maxPods` within the `.spec.kubeletConfiguration`. All nodes spawned by this provisioner will set this `maxPods` value on their kubelet and will account for this value during scheduling. - -See [Provisioner API Kubelet Configuration](../provisioners/#max-pods) for more details. - -#### Dynamic Pod Density - -Dynamic pod density (density that scales with the instance size) can be configured at the provisioner level by specifying `podsPerCore` within the `.spec.kubeletConfiguration`. Karpenter will calculate the expected pod density for each instance based on the instance's number of logical cores (vCPUs) and will account for this during scheduling. - -See [Provisioner API Kubelet Configuration](../provisioners/#pod-density) for more details. - -### Controller-Wide Pod Density - -{{% alert title="Deprecation Warning" color="warning" %}} -`AWS_ENI_LIMITED_POD_DENSITY` is deprecated in favor of the `.spec.kubeletConfiguration.maxPods` set at the Provisioner-level -{{% /alert %}} - -Set the environment variable `AWS_ENI_LIMITED_POD_DENSITY: "false"` (or the argument `--aws-eni-limited-pod-density=false`) in the Karpenter controller to allow nodes to host up to 110 pods by default. - -Environment variables for the Karpenter controller may be specified as [helm chart values](https://github.com/aws/karpenter/blob/c73f425e924bb64c3f898f30ca5035a1d8591183/charts/karpenter/values.yaml#L15). - -### VPC CNI Custom Networking - -By default, the VPC CNI allocates IPs for a node and pods from the same subnet. With [VPC CNI Custom Networking](https://aws.github.io/aws-eks-best-practices/networking/custom-networking), the pods will receive IP addresses from another subnet dedicated to pod IPs. This approach makes it easier to manage IP addresses and allows for separate Network Access Control Lists (NACLs) applied to your pods. VPC CNI Custom Networking reduces the pod density of a node since one of the ENI attachments will be used for the node and cannot share the allocated IPs on the interface to pods. Karpenter supports VPC CNI Custom Networking and similar CNI setups where the primary node interface is separated from the pods interfaces through a global [setting](../reference/settings.md#configmap) within the karpenter-global-settings configmap: `aws.reservedENIs`. In the common case, `aws.reservedENIs` should be set to `"1"` if using Custom Networking. - -{{% alert title="Windows Support Notice" color="warning" %}} -It's currently not possible to specify custom networking with Windows nodes. -{{% /alert %}} - -## Limit Pod Density - -Generally, increasing pod density is more efficient. However, some use cases exist for limiting pod density. - -### Topology Spread - -You can use [topology spread]({{< relref "scheduling.md#topology-spread" >}}) features to reduce blast radius. For example, spreading workloads across EC2 Availability Zones. - - -### Restrict Instance Types - -Exclude large instance sizes to reduce the blast radius of an EC2 instance failure. - -Consider setting up upper or lower boundaries on target instance sizes with the node.kubernetes.io/instance-type key. - -The following example shows how to avoid provisioning large Graviton instances in order to reduce the impact of individual instance failures: - -``` --key: node.kubernetes.io/instance-type - operator: NotIn - values: - 'm6g.16xlarge' - 'm6gd.16xlarge' - 'r6g.16xlarge' - 'r6gd.16xlarge' - 'c6g.16xlarge' -``` diff --git a/website/content/en/preview/troubleshooting.md b/website/content/en/preview/troubleshooting.md index e16c2d97397a..430b2669d5f1 100644 --- a/website/content/en/preview/troubleshooting.md +++ b/website/content/en/preview/troubleshooting.md @@ -255,7 +255,7 @@ spec: When attempting to schedule a large number of pods with PersistentVolumes, it's possible that these pods will co-locate on the same node. Pods will report the following errors in their events using a `kubectl describe pod` call -```console +```bash Warning FailedAttachVolume pod/example-pod AttachVolume.Attach failed for volume "***" : rpc error: code = Internal desc = Could not attach volume "***" to node "***": attachment of disk "***" failed, expected device to be attached but was attaching Warning FailedMount pod/example-pod Unable to attach or mount volumes: unmounted volumes=[***], unattached volumes=[***]: timed out waiting for the condition ``` @@ -266,7 +266,7 @@ In this case, Karpenter may fail to scale-up your nodes due to these pods due to Karpenter does not support [in-tree storage plugins](https://kubernetes.io/blog/2021/12/10/storage-in-tree-to-csi-migration-status-update/) to provision PersistentVolumes, since nearly all of the in-tree plugins have been deprecated in upstream Kubernetes. This means that, if you are using a statically-provisioned PersistentVolume that references a volume source like `AWSElasticBlockStore` or a dynamically-provisioned PersistentVolume that references a StorageClass with a in-tree storage plugin provisioner like `kubernetes.io/aws-ebs`, Karpenter will fail to discover the maxiumum volume attachments for the node. Instead, Karpenter may think the node still has more schedulable space due to memory and cpu constraints when there is really no more schedulable space on the node due to volume limits. When Karpenter sees you are using an in-tree storage plugin on your pod volumes, it will print the following error message into the logs. If you see this message, upgrade your StorageClasses and statically-provisioned PersistentVolumes to use the latest CSI drivers for your cloud provider. -```console +```bash 2023-04-05T23:56:53.363Z ERROR controller.node_state PersistentVolume source 'AWSElasticBlockStore' uses an in-tree storage plugin which is unsupported by Karpenter and is deprecated by Kubernetes. Scale-ups may fail because Karpenter will not discover driver limits. Use a PersistentVolume that references the 'CSI' volume source for Karpenter auto-scaling support. {"commit": "b2af562", "node": "ip-192-168-36-137.us-west-2.compute.internal", "pod": "inflate0-6c4bdb8b75-7qmfd", "volume": "mypd", "persistent-volume": "pvc-11db7489-3c6e-46f3-a958-91f9d5009d41"} 2023-04-05T23:56:53.464Z ERROR controller.node_state StorageClass .spec.provisioner uses an in-tree storage plugin which is unsupported by Karpenter and is deprecated by Kubernetes. Scale-ups may fail because Karpenter will not discover driver limits. Create a new StorageClass with a .spec.provisioner referencing the CSI driver plugin name 'ebs.csi.aws.com'. {"commit": "b2af562", "node": "ip-192-168-36-137.us-west-2.compute.internal", "pod": "inflate0-6c4bdb8b75-7qmfd", "volume": "mypd", "storage-class": "gp2", "provisioner": "kubernetes.io/aws-ebs"} ``` @@ -299,7 +299,7 @@ To avoid this discrepancy between `maxPods` and the supported pod density of the 2. Reduce your `maxPods` value to be under the maximum pod density for the instance types assigned to your Provisioner 3. Remove the `maxPods` value from your [`kubeletConfiguration`]({{}}) if you no longer need it and instead rely on the defaulted values from Karpenter and EKS AMIs. -For more information on pod density, view the [Pod Density Conceptual Documentation]({{}}). +For more information on pod density, view the [Pod Density Section in the NodePools doc]({{}}). #### IP exhaustion in a subnet diff --git a/website/content/en/preview/upgrade-guide.md b/website/content/en/preview/upgrade-guide.md index 37e590b54031..4a4e3772432c 100644 --- a/website/content/en/preview/upgrade-guide.md +++ b/website/content/en/preview/upgrade-guide.md @@ -194,7 +194,7 @@ Karpenter marks CloudProvider capacity as "managed by" a Machine using the `karp ### Upgrading to v0.27.3+ * The `defaulting.webhook.karpenter.sh` mutating webhook was removed in `v0.27.3`. If you are coming from an older version of Karpenter where this webhook existed and the webhook was not managed by Helm, you may need to delete the stale webhook. -```console +```bash kubectl delete mutatingwebhookconfigurations defaulting.webhook.karpenter.sh ```