Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add runbooks for observability controller alerts #280

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 88 additions & 0 deletions docs/runbooks/HAControlPlaneDown.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# HAControlPlaneDown

## Meaning

A control plane node has been detected as not ready for more than 5 minutes.

## Impact

When a control plane node is down, it affects the high availability and
redundancy of the Kubernetes control plane. This can negatively impact:
- API server availability
- Controller manager operations
- Scheduler functionality
- etcd cluster health (if etcd is co-located)

## Diagnosis

1. Check the status of all control plane nodes:
```bash
kubectl get nodes -l node-role.kubernetes.io/control-plane=''
```

2. Get detailed information about the affected node:
```bash
kubectl describe node <node-name>
```

3. Review system logs on the affected node:
```bash
ssh <node-address>
```

```bash
journalctl -xeu kubelet
```

## Mitigation

1. Check node resources:
- Verify CPU, memory, and disk usage
machadovilaca marked this conversation as resolved.
Show resolved Hide resolved
```bash
# Check the node's CPU and memory resource usage
kubectl top node <node-name>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we know how much usage will count as problematic?

```

```bash
# Check node status conditions for DiskPressure status
kubectl get node <node-name> -o yaml | jq '.status.conditions[] | select(.type == "DiskPressure")'
```
- Clear disk space if necessary
- Restart the kubelet if resource issues are resolved

2. If the node is unreachable:
- Verify network connectivity
- Check physical/virtual machine status
- Ensure the node has power and is running

3. If the kubelet is generating errors:
```bash
systemctl status kubelet
```

```bash
systemctl restart kubelet
```

4. If the node cannot be recovered:
- If possible, safely drain the node
machadovilaca marked this conversation as resolved.
Show resolved Hide resolved
```bash
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
```
- Investigate hardware/infrastructure issues
- Consider replacing the node if necessary

## Additional notes
- Maintain at least three control plane nodes for high availability
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Maintain at least three control plane nodes for high availability
- Maintain at least 3 control plane nodes for high availability

- Monitor etcd cluster health if the affected node runs etcd
- Document any infrastructure-specific recovery procedures
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this sentence, where to document? how it can help?


<!--DS: If you cannot resolve the issue, log in to the
link:https://access.redhat.com[Customer Portal] and open a support case,
attaching the artifacts gathered during the diagnosis procedure.-->
<!--USstart-->
If you cannot resolve the issue, see the following resources:

- [OKD Help](https://www.okd.io/help/)
- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization)
<!--USend-->
66 changes: 66 additions & 0 deletions docs/runbooks/HighCPUWorkload.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# HighCPUWorkload

## Meaning

This alert fires when a node's CPU utilization exceeds 90% for more than 5 minutes.

## Impact

High CPU utilization can lead to:
- Degraded performance of applications running on the node
- Increased latency in request processing
- Potential service disruptions if CPU usage continues to climb

## Diagnosis

1. Identify the affected node:
```bash
kubectl get nodes
```

2. Check the node's resource usage:
```bash
kubectl describe node <node-name>
```

3. List pods that consume high amounts of CPU:
```bash
kubectl top pods --all-namespaces --sort-by=cpu
```

4. Investigate specific pod details if needed:
```bash
kubectl describe pod <pod-name> -n <namespace>
```

## Mitigation

1. If the issue was caused by a malfunctioning pod:
- Consider restarting the pod
- Check pod logs for anomalies
- Review pod resource limits and requests

2. If the issue is system-wide:
- Check for system processes that consume high amounts of CPU
- Consider cordoning the node and migrating workloads
- Evaluate if node scaling is needed

3. Long-term solutions to avoid the issue:
- Implement or adjust pod resource limits
- Consider horizontal pod autoscaling
- Evaluate cluster capacity and scaling needs

## Additional notes
- Monitor the node after mitigation to ensure CPU usage returns to normal
- Review application logs for potential root causes
- Consider updating resource requests/limits if this is a recurring issue

<!--DS: If you cannot resolve the issue, log in to the
link:https://access.redhat.com[Customer Portal] and open a support case,
attaching the artifacts gathered during the diagnosis procedure.-->
<!--USstart-->
If you cannot resolve the issue, see the following resources:

- [OKD Help](https://www.okd.io/help/)
- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization)
<!--USend-->
85 changes: 85 additions & 0 deletions docs/runbooks/NodeNetworkInterfaceDown.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# NodeNetworkInterfaceDown

## Meaning

This alert fires when one or more network interfaces on a node have been down
for more than 5 minutes. The alert excludes virtual ethernet (veth) devices and
bridge tunnels.

## Impact

Network interface failures can lead to:
- Reduced network connectivity for pods on the affected node
- Potential service disruptions if critical network paths are affected
- Degraded cluster communication if management interfaces are impacted

## Diagnosis

1. Identify the affected node and interfaces:
```bash
kubectl get nodes
```

```bash
ssh <node-address>
```

```bash
ip link show | grep -i down
```

2. Check network interface details:
```bash
ip addr show
```

```bash
ethtool <interface-name>
```

3. Review system logs for network-related issues:
```bash
journalctl -u NetworkManager
```

```bash
dmesg | grep -i eth
```

## Mitigation

1. For physical interface issues:
- Check physical cable connections
- Verify switch port configuration
- Test the interface with a different cable/port

2. For software or configuration issues:
```bash
# Restart NetworkManager
systemctl restart NetworkManager
```

```bash
# Bring interface up manually
ip link set <interface-name> up
```

3. If the issue persists:
- Check network interface configuration files
- Verify driver compatibility
- If the failure is on a physical interface, consider hardware replacement

## Additional notes
- Monitor interface status after mitigation
- Document any hardware replacements or configuration changes
- Consider implementing network redundancy for critical interfaces

<!--DS: If you cannot resolve the issue, log in to the
link:https://access.redhat.com[Customer Portal] and open a support case,
attaching the artifacts gathered during the diagnosis procedure.-->
<!--USstart-->
If you cannot resolve the issue, see the following resources:

- [OKD Help](https://www.okd.io/help/)
- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization)
<!--USend-->