Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add runbook for KubeMemoryOvercommit #39

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions content/runbooks/kubernetes/KubeMemoryOvercommit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# KubeMemoryOvercommit

## Meaning

This alert fires if the cluster does not have enough total memory to tolerate failure of the largest node in the cluster.

<details>
<summary>Full context</summary>

Each pod requests a certain amount of memory in the pod spec field `resource.requests.memory`. This value can also be
found via the metric `kube_pod_container_resource_requests{resource="memory"}`. If a node failure occurs, it's possible that
some pods will not be rescheduled due to a lack of resources. Thus it's recommended that the cluster has enough total resources
to tolerate a failure of the largest node, at least until that node is replaced.

This alert is calculated by comparing the total memory requested by the pods to the total memory available in the cluster minus the
amount of memory on the largest node.

</details>

## Impact

There is no immediate impact of this alert, however, if a node failure occurs, cluster availability will likely be affected.

## Diagnosis

Check the number and types of nodes being used in the cluster to decide if an additional node is needed. This could also be caused by an imbalance of node groups. For example, if there is a single large node running an app with a large memory requirement, it may not be schedulable
if that one large node fails.

## Mitigation

Adding an additional node (of the largest type in the cluster) or reducing the pod memory requests will normally resolve this issue.
Alternatively, if there are multiple node groups of different types, it may be possible to re-balance the cluster by adding a large
node and removing some small nodes.