Skip to content

Commit

Permalink
docs: add docs with new taint (aws#4950)
Browse files Browse the repository at this point in the history
  • Loading branch information
njtran authored Oct 27, 2023
1 parent 84e94a3 commit 8f604a7
Showing 1 changed file with 33 additions and 27 deletions.
60 changes: 33 additions & 27 deletions website/content/en/preview/concepts/disruption.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ description: >

## Control Flow

Karpenter sets a Kubernetes [finalizer](https://kubernetes.io/docs/concepts/overview/working-with-objects/finalizers/) on each node it provisions.
The finalizer blocks deletion of the node object while the Termination Controller cordons and drains the node, before removing the underlying NodeClaim. Disruption is triggered by the Disruption Controller, by the user through manual disruption, or through an external system that sends a delete request to the node object.
Karpenter sets a Kubernetes [finalizer](https://kubernetes.io/docs/concepts/overview/working-with-objects/finalizers/) on each node and node claim it provisions.
The finalizer blocks deletion of the node object while the Termination Controller taints and drains the node, before removing the underlying NodeClaim. Disruption is triggered by the Disruption Controller, by the user through manual disruption, or through an external system that sends a delete request to the node object.

### Disruption Controller

Expand All @@ -18,35 +18,41 @@ Karpenter automatically discovers disruptable nodes and spins up replacements wh
* If there are [pods that cannot be evicted](#pod-eviction) on the node, Karpenter will ignore the node and try disrupting it later.
* If there are no disruptable nodes, continue to the next disruption method.
2. For each disruptable node, execute a scheduling simulation with the pods on the node to find if any replacement nodes are needed.
3. Cordon the node(s) to prevent pods from scheduling to it.
3. Add the `karpenter.sh/disruption:NoSchedule` taint to the node(s) to prevent pods from scheduling to it.
4. Pre-spin any replacement nodes needed as calculated in Step (2), and wait for them to become ready.
* If a replacement node fails to initialize, un-cordon the node(s), and restart from Step (1), starting at the first disruption method again.
* If a replacement node fails to initialize, un-taint the node(s), and restart from Step (1), starting at the first disruption method again.
5. Delete the node(s) and wait for the Termination Controller to gracefully shutdown the node(s).
6. Once the Termination Controller terminates the node, go back to Step (1), starting at the first disruption method again.

### Termination Controller

When a Karpenter node is deleted, the Karpenter finalizer will block deletion and the APIServer will set the `DeletionTimestamp` on the node, allowing Karpenter to gracefully shutdown the node, modeled after [Kubernetes Graceful Node Shutdown](https://kubernetes.io/docs/concepts/architecture/nodes/#graceful-node-shutdown). Karpenter's graceful shutdown process will:
1. Cordon the node to prevent pods from scheduling to it.
1. Add the `karpenter.sh/disruption:NoSchedule` taint to the node to prevent pods from scheduling to it.
2. Begin evicting the pods on the node with the [Kubernetes Eviction API](https://kubernetes.io/docs/concepts/scheduling-eviction/api-eviction/) to respect PDBs, while ignoring all non-daemonset pods and [static pods](https://kubernetes.io/docs/tasks/configure-pod-container/static-pod/). Wait for the node to be fully drained before proceeding to Step (3).
* While waiting, if the underlying NodeClaim for the node no longer exists, remove the finalizer to allow the APIServer to delete the node, completing termination.
3. Terminate the NodeClaim in the Cloud Provider.
4. Remove the finalizer from the node to allow the APIServer to delete the node, completing termination.

## Manual Methods
* **Node Deletion**: You could use `kubectl` to manually remove a single Karpenter node:
* **Node Deletion**: You can use `kubectl` to manually remove a single Karpenter node or nodeclaim. Since each Karpenter node is owned by a NodeClaim, deleting either the node or the nodeclaim will cause cascade deletion of the other:

```bash
# Delete a specific nodeclaim
kubectl delete nodeclaim $NODECLAIM_NAME

# Delete a specific node
kubectl delete node $NODE_NAME

# Delete all nodeclaims
kubectl delete nodeclaims --all

# Delete all nodes owned by any nodepool
kubectl delete nodes -l karpenter.sh/nodepool

# Delete all nodes owned by a specific nodepool
kubectl delete nodes -l karpenter.sh/nodepool=$NODEPOOL_NAME
# Delete all nodeclaims owned by a specific nodepoolXS
kubectl delete nodeclaims -l karpenter.sh/nodepool=$NODEPOOL_NAME
```
* **NodePool Deletion**: Nodes are owned by the NodePool through an [owner reference](https://kubernetes.io/docs/concepts/overview/working-with-objects/owners-dependents/#owner-references-in-object-specifications) that launched them. Karpenter will gracefully terminate nodes through cascading deletion when the owning NodePool is deleted.
* **NodePool Deletion**: NodeClaims are owned by the NodePool through an [owner reference](https://kubernetes.io/docs/concepts/overview/working-with-objects/owners-dependents/#owner-references-in-object-specifications) that launched them. Karpenter will gracefully terminate nodes through cascading deletion when the owning NodePool is deleted.

{{% alert title="Note" color="primary" %}}
By adding the finalizer, Karpenter improves the default Kubernetes process of node deletion.
Expand All @@ -57,11 +63,11 @@ When you run `kubectl delete node` on a node without a finalizer, the node is de

* **Expiration**: Karpenter will mark nodes as expired and disrupt them after they have lived a set number of seconds, based on the NodePool's `spec.disruption.expireAfter` value. You can use node expiry to periodically recycle nodes due to security concerns.
* [**Consolidation**]({{<ref "#consolidation" >}}): Karpenter works to actively reduce cluster cost by identifying when:
* Nodes can be removed because the node is empty
* Nodes can be removed because the node is empty
* Nodes can be removed as their workloads will run on other nodes in the cluster.
* Nodes can be replaced with cheaper variants due to a change in the workloads.
* [**Drift**]({{<ref "#drift" >}}): Karpenter will mark nodes as drifted and disrupt nodes that have drifted from their desired specification. See [Drift]({{<ref "#drift" >}}) to see which fields are considered.
* [**Interruption**]({{<ref "#interruption" >}}): Karpenter will watch for upcoming interruption events that could affect your nodes (health events, spot interruption, etc.) and will cordon, drain, and terminate the node(s) ahead of the event to reduce workload disruption.
* [**Interruption**]({{<ref "#interruption" >}}): Karpenter will watch for upcoming interruption events that could affect your nodes (health events, spot interruption, etc.) and will taint, drain, and terminate the node(s) ahead of the event to reduce workload disruption.
{{% alert title="Defaults" color="secondary" %}}
Disruption is configured through the NodePool's disruption block by the `consolidationPolicy`, `expireAfter` and `consolidateAfter` fields. Karpenter will configure these fields with the following values by default if they are not set:
Expand Down Expand Up @@ -113,7 +119,7 @@ For spot nodes, Karpenter only uses the deletion consolidation mechanism. It wi
### Drift
Drift on most fields are only triggered by changes to the owning CustomResource. Some special cases will be reconciled two-ways, triggered by NodeClaim/Node/Instance changes or NodePool/EC2NodeClass changes. For one-way reconciliation, values in the CustomResource are reflected in the NodeClaim in the same way that they’re set. A NodeClaim will be detected as drifted if the values in the CRDs do not match the values in the NodeClaim. By default, fields are drifted using one-way reconciliation.
Drift on most fields are only triggered by changes to the owning CustomResource. Some special cases will be reconciled two-ways, triggered by NodeClaim/Node/Instance changes or NodePool/EC2NodeClass changes. For one-way reconciliation, values in the CustomResource are reflected in the NodeClaim in the same way that they’re set. A NodeClaim will be detected as drifted if the values in the CRDs do not match the values in the NodeClaim. By default, fields are drifted using one-way reconciliation.
#### Two-way Reconciliation
Two-way reconciliation can correspond to multiple values and must be handled differently. Two-way reconciliation can create cases where drift occurs without changes to CRDs, or where CRD changes do not result in drift. For example, if a NodeClaim has `node.kubernetes.io/instance-type: m5.large`, and requirements change from `node.kubernetes.io/instance-type In [m5.large]` to `node.kubernetes.io/instance-type In [m5.large, m5.2xlarge]`, the NodeClaim will not be drifted because its value is still compatible with the new requirements. Conversely, for an AWS Installation, if a NodeClaim is using a NodeClaim image `ami: ami-abc`, but a new image is published, Karpenter's `AWSNodeTemplate.amiSelector` will discover that the new correct value is `ami: ami-xyz`, and detect the NodeClaim as drifted.
Expand All @@ -124,22 +130,22 @@ Behavioral Fields are treated as over-arching settings on the NodePool to dictat
Read the [Drift Design](https://github.com/aws/karpenter-core/blob/main/designs/drift.md) for more.
##### NodePool
| Fields | One-way | Two-way |
|----------------------------| :---: | :---: |
| Startup Taints | x | |
| Taints | x | |
| Labels | x | |
| Annotations | x | |
| Node Requirements | | x |
| Fields | One-way | Two-way |
|----------------------------| :---: | :---: |
| Startup Taints | x | |
| Taints | x | |
| Labels | x | |
| Annotations | x | |
| Node Requirements | | x |
| Kubelet Configuration | x | |
__Behavioral Fields__
- Weight
- Limits
- ConsolidationPolicy
- ConsolidateAfter
- ExpireAfter
---
- Weight
- Limits
- ConsolidationPolicy
- ConsolidateAfter
- ExpireAfter
---
##### EC2NodeClass
| Fields | One-way | Two-way |
|-------------------------------|:-------:|:-------:|
Expand Down Expand Up @@ -168,12 +174,12 @@ If interruption-handling is enabled, Karpenter will watch for upcoming involunta
* Instance Terminating Events
* Instance Stopping Events
When Karpenter detects one of these events will occur to your nodes, it automatically cordons, drains, and terminates the node(s) ahead of the interruption event to give the maximum amount of time for workload cleanup prior to compute disruption. This enables scenarios where the `terminationGracePeriod` for your workloads may be long or cleanup for your workloads is critical, and you want enough time to be able to gracefully clean-up your pods.
When Karpenter detects one of these events will occur to your nodes, it automatically taints, drains, and terminates the node(s) ahead of the interruption event to give the maximum amount of time for workload cleanup prior to compute disruption. This enables scenarios where the `terminationGracePeriod` for your workloads may be long or cleanup for your workloads is critical, and you want enough time to be able to gracefully clean-up your pods.
For Spot interruptions, the NodePool will start a new node as soon as it sees the Spot interruption warning. Spot interruptions have a __2 minute notice__ before Amazon EC2 reclaims the instance. Karpenter's average node startup time means that, generally, there is sufficient time for the new node to become ready and to move the pods to the new node before the NodeClaim is reclaimed.
{{% alert title="Note" color="primary" %}}
Karpenter publishes Kubernetes events to the node for all events listed above in addition to __Spot Rebalance Recommendations__. Karpenter does not currently support cordon, drain, and terminate logic for Spot Rebalance Recommendations.
Karpenter publishes Kubernetes events to the node for all events listed above in addition to __Spot Rebalance Recommendations__. Karpenter does not currently support taint, drain, and terminate logic for Spot Rebalance Recommendations.
{{% /alert %}}
Karpenter enables this feature by watching an SQS queue which receives critical events from AWS services which may affect your nodes. Karpenter requires that an SQS queue be provisioned and EventBridge rules and targets be added that forward interruption events from AWS services to the SQS queue. Karpenter provides details for provisioning this infrastructure in the [CloudFormation template in the Getting Started Guide](../../getting-started/getting-started-with-karpenter/#create-the-karpenter-infrastructure-and-iam-roles).
Expand Down

0 comments on commit 8f604a7

Please sign in to comment.