Add runbooks for observability controller alerts

- HAControlPlaneDown - NodeNetworkInterfaceDown - HighCPUWorkload Signed-off-by: João Vilaça <[email protected]>
kubevirt · Jan 21, 2025 · b4fcc01 · b4fcc01
1 parent 8c69bc4
commit b4fcc01
Show file tree

Hide file tree

Showing 3 changed files with 207 additions and 0 deletions.
diff --git a/docs/runbooks/HAControlPlaneDown.md b/docs/runbooks/HAControlPlaneDown.md
@@ -0,0 +1,70 @@
+# HAControlPlaneDown
+
+## Meaning
+
+A control plane node has been detected as not ready for more than 5 minutes.
+
+## Impact
+
+When a control plane node is down, it affects the high availability and
+redundancy of the Kubernetes control plane. This can impact:
+- API server availability
+- Controller manager operations
+- Scheduler functionality
+- etcd cluster health (if etcd is co-located)
+
+## Diagnosis
+
+1. Check the status of all control plane nodes:
+   ```bash
+   kubectl get nodes -l node-role.kubernetes.io/control-plane=''
+   ```
+
+2. Get detailed information about the affected node:
+   ```bash
+   kubectl describe node <node-name>
+   ```
+
+3. Review system logs on the affected node:
+   ```bash
+   ssh <node-address>
+   journalctl -xeu kubelet
+   ```
+
+## Mitigation
+
+1. Check node resources:
+   - Verify CPU, memory, and disk usage
+   - Clear disk space if necessary
+   - Restart kubelet if resource issues are resolved
+
+2. If node is unreachable:
+   - Verify network connectivity
+   - Check physical/virtual machine status
+   - Ensure node has power and is running
+
+3. For kubelet issues:
+   ```bash
+   systemctl status kubelet
+   systemctl restart kubelet
+   ```
+
+4. If the node cannot be recovered:
+   - If possible, safely drain the node
+   - Investigate hardware/infrastructure issues
+   - Consider replacing the node if necessary
+
+## Additional Notes
+- Maintain at least three control plane nodes for high availability
+- Monitor etcd cluster health if affected node runs etcd
+- Document any infrastructure-specific recovery procedures
+
+<!--DS: If you cannot resolve the issue, log in to the
+link:https://access.redhat.com[Customer Portal] and open a support case,
+attaching the artifacts gathered during the diagnosis procedure.-->
+<!--USstart-->
+If you cannot resolve the issue, see the following resources:
+
+- [OKD Help](https://www.okd.io/help/)
+- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization)
+<!--USend-->
diff --git a/docs/runbooks/HighCPUWorkload.md b/docs/runbooks/HighCPUWorkload.md
@@ -0,0 +1,66 @@
+# HighCPUWorkload
+
+## Meaning
+
+This alert fires when a node's CPU utilization exceeds 90% for more than 5 minutes.
+
+## Impact
+
+High CPU utilization can lead to:
+- Degraded performance of applications running on the node
+- Increased latency in request processing
+- Potential service disruptions if CPU usage continues to climb
+
+## Diagnosis
+
+1. Identify the affected node:
+   ```bash
+   kubectl get nodes
+   ```
+
+2. Check node resource usage:
+   ```bash
+   kubectl describe node <node-name>
+   ```
+
+3. List pods consuming high CPU:
+   ```bash
+   kubectl top pods --all-namespaces --sort-by=cpu
+   ```
+
+4. Investigate specific pod details if needed:
+   ```bash
+   kubectl describe pod <pod-name> -n <namespace>
+   ```
+
+## Mitigation
+
+1. If caused by a misbehaving pod:
+   - Consider restarting the pod
+   - Check pod logs for anomalies
+   - Review pod resource limits and requests
+
+2. If system-wide:
+   - Check for system processes consuming high CPU
+   - Consider cordoning the node and migrating workloads
+   - Evaluate if node scaling is needed
+
+3. Long-term solutions:
+   - Implement or adjust pod resource limits
+   - Consider horizontal pod autoscaling
+   - Evaluate cluster capacity and scaling needs
+
+## Additional Notes
+- Monitor the node after mitigation to ensure CPU usage returns to normal
+- Review application logs for potential root causes
+- Consider updating resource requests/limits if this is a recurring issue
+
+<!--DS: If you cannot resolve the issue, log in to the
+link:https://access.redhat.com[Customer Portal] and open a support case,
+attaching the artifacts gathered during the diagnosis procedure.-->
+<!--USstart-->
+If you cannot resolve the issue, see the following resources:
+
+- [OKD Help](https://www.okd.io/help/)
+- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization)
+<!--USend-->
diff --git a/docs/runbooks/NodeNetworkInterfaceDown.md b/docs/runbooks/NodeNetworkInterfaceDown.md
@@ -0,0 +1,71 @@
+# NodeNetworkInterfaceDown
+
+## Meaning
+
+This alert fires when one or more network interfaces on a node have been down
+for more than 5 minutes. The alert excludes virtual ethernet (veth) devices and
+bridge tunnels.
+
+## Impact
+
+Network interface failures can lead to:
+- Reduced network connectivity for pods on the affected node
+- Potential service disruptions if critical network paths are affected
+- Degraded cluster communication if management interfaces are impacted
+
+## Diagnosis
+
+1. Identify the affected node and interfaces:
+   ```bash
+   kubectl get nodes
+   ssh <node-address>
+   ip link show | grep -i down
+   ```
+
+2. Check network interface details:
+   ```bash
+   ip addr show
+   ethtool <interface-name>
+   ```
+
+3. Review system logs for network-related issues:
+   ```bash
+   journalctl -u NetworkManager
+   dmesg | grep -i eth
+   ```
+
+## Mitigation
+
+1. For physical interface issues:
+   - Check physical cable connections
+   - Verify switch port configuration
+   - Test interface with different cable/port
+
+2. For software/configuration issues:
+   ```bash
+   # Restart NetworkManager
+   systemctl restart NetworkManager
+
+   # Bring interface up manually
+   ip link set <interface-name> up
+   ```
+
+3. If persistent:
+   - Check network interface configuration files
+   - Verify driver compatibility
+   - Consider hardware replacement if physical failure
+
+## Additional Notes
+- Monitor interface status after mitigation
+- Document any hardware replacements or configuration changes
+- Consider implementing network redundancy for critical interfaces
+
+<!--DS: If you cannot resolve the issue, log in to the
+link:https://access.redhat.com[Customer Portal] and open a support case,
+attaching the artifacts gathered during the diagnosis procedure.-->
+<!--USstart-->
+If you cannot resolve the issue, see the following resources:
+
+- [OKD Help](https://www.okd.io/help/)
+- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization)
+<!--USend-->