-
Notifications
You must be signed in to change notification settings - Fork 41
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add runbooks for observability controller alerts
- HAControlPlaneDown - NodeNetworkInterfaceDown - HighCPUWorkload Signed-off-by: João Vilaça <[email protected]>
- Loading branch information
1 parent
8c69bc4
commit b4fcc01
Showing
3 changed files
with
207 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
# HAControlPlaneDown | ||
|
||
## Meaning | ||
|
||
A control plane node has been detected as not ready for more than 5 minutes. | ||
|
||
## Impact | ||
|
||
When a control plane node is down, it affects the high availability and | ||
redundancy of the Kubernetes control plane. This can impact: | ||
- API server availability | ||
- Controller manager operations | ||
- Scheduler functionality | ||
- etcd cluster health (if etcd is co-located) | ||
|
||
## Diagnosis | ||
|
||
1. Check the status of all control plane nodes: | ||
```bash | ||
kubectl get nodes -l node-role.kubernetes.io/control-plane='' | ||
``` | ||
|
||
2. Get detailed information about the affected node: | ||
```bash | ||
kubectl describe node <node-name> | ||
``` | ||
|
||
3. Review system logs on the affected node: | ||
```bash | ||
ssh <node-address> | ||
journalctl -xeu kubelet | ||
``` | ||
|
||
## Mitigation | ||
|
||
1. Check node resources: | ||
- Verify CPU, memory, and disk usage | ||
- Clear disk space if necessary | ||
- Restart kubelet if resource issues are resolved | ||
|
||
2. If node is unreachable: | ||
- Verify network connectivity | ||
- Check physical/virtual machine status | ||
- Ensure node has power and is running | ||
|
||
3. For kubelet issues: | ||
```bash | ||
systemctl status kubelet | ||
systemctl restart kubelet | ||
``` | ||
|
||
4. If the node cannot be recovered: | ||
- If possible, safely drain the node | ||
- Investigate hardware/infrastructure issues | ||
- Consider replacing the node if necessary | ||
|
||
## Additional Notes | ||
- Maintain at least three control plane nodes for high availability | ||
- Monitor etcd cluster health if affected node runs etcd | ||
- Document any infrastructure-specific recovery procedures | ||
|
||
<!--DS: If you cannot resolve the issue, log in to the | ||
link:https://access.redhat.com[Customer Portal] and open a support case, | ||
attaching the artifacts gathered during the diagnosis procedure.--> | ||
<!--USstart--> | ||
If you cannot resolve the issue, see the following resources: | ||
|
||
- [OKD Help](https://www.okd.io/help/) | ||
- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) | ||
<!--USend--> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
# HighCPUWorkload | ||
|
||
## Meaning | ||
|
||
This alert fires when a node's CPU utilization exceeds 90% for more than 5 minutes. | ||
|
||
## Impact | ||
|
||
High CPU utilization can lead to: | ||
- Degraded performance of applications running on the node | ||
- Increased latency in request processing | ||
- Potential service disruptions if CPU usage continues to climb | ||
|
||
## Diagnosis | ||
|
||
1. Identify the affected node: | ||
```bash | ||
kubectl get nodes | ||
``` | ||
|
||
2. Check node resource usage: | ||
```bash | ||
kubectl describe node <node-name> | ||
``` | ||
|
||
3. List pods consuming high CPU: | ||
```bash | ||
kubectl top pods --all-namespaces --sort-by=cpu | ||
``` | ||
|
||
4. Investigate specific pod details if needed: | ||
```bash | ||
kubectl describe pod <pod-name> -n <namespace> | ||
``` | ||
|
||
## Mitigation | ||
|
||
1. If caused by a misbehaving pod: | ||
- Consider restarting the pod | ||
- Check pod logs for anomalies | ||
- Review pod resource limits and requests | ||
|
||
2. If system-wide: | ||
- Check for system processes consuming high CPU | ||
- Consider cordoning the node and migrating workloads | ||
- Evaluate if node scaling is needed | ||
|
||
3. Long-term solutions: | ||
- Implement or adjust pod resource limits | ||
- Consider horizontal pod autoscaling | ||
- Evaluate cluster capacity and scaling needs | ||
|
||
## Additional Notes | ||
- Monitor the node after mitigation to ensure CPU usage returns to normal | ||
- Review application logs for potential root causes | ||
- Consider updating resource requests/limits if this is a recurring issue | ||
|
||
<!--DS: If you cannot resolve the issue, log in to the | ||
link:https://access.redhat.com[Customer Portal] and open a support case, | ||
attaching the artifacts gathered during the diagnosis procedure.--> | ||
<!--USstart--> | ||
If you cannot resolve the issue, see the following resources: | ||
|
||
- [OKD Help](https://www.okd.io/help/) | ||
- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) | ||
<!--USend--> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
# NodeNetworkInterfaceDown | ||
|
||
## Meaning | ||
|
||
This alert fires when one or more network interfaces on a node have been down | ||
for more than 5 minutes. The alert excludes virtual ethernet (veth) devices and | ||
bridge tunnels. | ||
|
||
## Impact | ||
|
||
Network interface failures can lead to: | ||
- Reduced network connectivity for pods on the affected node | ||
- Potential service disruptions if critical network paths are affected | ||
- Degraded cluster communication if management interfaces are impacted | ||
|
||
## Diagnosis | ||
|
||
1. Identify the affected node and interfaces: | ||
```bash | ||
kubectl get nodes | ||
ssh <node-address> | ||
ip link show | grep -i down | ||
``` | ||
|
||
2. Check network interface details: | ||
```bash | ||
ip addr show | ||
ethtool <interface-name> | ||
``` | ||
|
||
3. Review system logs for network-related issues: | ||
```bash | ||
journalctl -u NetworkManager | ||
dmesg | grep -i eth | ||
``` | ||
|
||
## Mitigation | ||
|
||
1. For physical interface issues: | ||
- Check physical cable connections | ||
- Verify switch port configuration | ||
- Test interface with different cable/port | ||
|
||
2. For software/configuration issues: | ||
```bash | ||
# Restart NetworkManager | ||
systemctl restart NetworkManager | ||
|
||
# Bring interface up manually | ||
ip link set <interface-name> up | ||
``` | ||
|
||
3. If persistent: | ||
- Check network interface configuration files | ||
- Verify driver compatibility | ||
- Consider hardware replacement if physical failure | ||
|
||
## Additional Notes | ||
- Monitor interface status after mitigation | ||
- Document any hardware replacements or configuration changes | ||
- Consider implementing network redundancy for critical interfaces | ||
|
||
<!--DS: If you cannot resolve the issue, log in to the | ||
link:https://access.redhat.com[Customer Portal] and open a support case, | ||
attaching the artifacts gathered during the diagnosis procedure.--> | ||
<!--USstart--> | ||
If you cannot resolve the issue, see the following resources: | ||
|
||
- [OKD Help](https://www.okd.io/help/) | ||
- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) | ||
<!--USend--> |