From 65d1942f65c780e14c31e855b29bf1dd6db9ee55 Mon Sep 17 00:00:00 2001 From: Berkay Tekin Oz Date: Mon, 13 Jan 2025 11:02:51 +0000 Subject: [PATCH] Add how-to troubleshoot for charm deployments --- docs/src/charm/howto/index.md | 1 + docs/src/charm/howto/troubleshooting.md | 244 ++++++++++++++++++++++++ docs/src/charm/reference/index.md | 2 + 3 files changed, 247 insertions(+) create mode 100644 docs/src/charm/howto/troubleshooting.md diff --git a/docs/src/charm/howto/index.md b/docs/src/charm/howto/index.md index 296380636..ab6a41d7e 100644 --- a/docs/src/charm/howto/index.md +++ b/docs/src/charm/howto/index.md @@ -26,6 +26,7 @@ custom-registry Upgrade patch version Upgrade minor version Validate the cluster +troubleshooting ``` --- diff --git a/docs/src/charm/howto/troubleshooting.md b/docs/src/charm/howto/troubleshooting.md new file mode 100644 index 000000000..509ab0cb6 --- /dev/null +++ b/docs/src/charm/howto/troubleshooting.md @@ -0,0 +1,244 @@ +# How to troubleshoot {{product}} + +Identifying issues in a Kubernetes cluster can be difficult, especially to new users. With {{product}} we aim to make deploying and managing your cluster as easy as possible. This how-to guide will walk you through the steps to troubleshoot your {{product}} cluster. + +## Common issues + +Maybe your issue has already been solved? Check out the [troubleshooting reference][charm-troubleshooting-reference] page to see a list of common issues and their solutions. Otherwise continue with this guide to help troubleshoot your {{product}} cluster. + +## Check the cluster status + +Verify that the cluster status is ready by running: + +``` +juju status +``` + +You should see a command output similar to the following: +``` +Model Controller Cloud/Region Version SLA Timestamp +k8s-testing localhost-localhost localhost/localhost 3.6.1 unsupported 09:06:50Z + +App Version Status Scale Charm Channel Rev Exposed Message +k8s 1.32.0 active 1 k8s 1.32/beta 179 no Ready +k8s-worker 1.32.0 active 1 k8s-worker 1.32/beta 180 no Ready + +Unit Workload Agent Machine Public address Ports Message +k8s-worker/0* active idle 1 10.94.106.154 Ready +k8s/0* active idle 0 10.94.106.136 6443/tcp Ready + +Machine State Address Inst id Base AZ Message +0 started 10.94.106.136 juju-380ff2-0 ubuntu@24.04 Running +1 started 10.94.106.154 juju-380ff2-1 ubuntu@24.04 Running +``` +Interpreting the Output: +- The `Workload` column shows the status of a given service. +- The `Message` section details the health of a given service in the cluster. +- The `Agent` column reflects any activity of the Juju agent. + +During deployment and maintenance the workload status will reflect the node's activity. An example workload may display `maintenance` along with the message details: `Ensuring snap installation`. + +During normal cluster operation the `Workload` column reads `active`, the `Agent` column shows `idle`, and the messages will either read `Ready` or another descriptive term. + +## Test the API server health + +Fetch the kubeconfig file for a control-plane node in the cluster by running: + +``` +juju run k8s/leader get-kubeconfig | yq .kubeconfig > cluster-kubeconfig.yaml +``` + +```{warning} +When running `juju run k8s/leader get-kubeconfig` you retrieve the kubeconfig file that uses one of the unit's public IP addresses in the kubernetes endpoint. This endpoint ip can be overriden by providing a `server` argument if the api is exposed through a load-balancer. +``` + +Verify that the API server is healthy and reachable by running: + +``` +kubectl --kubeconfig cluster-kubeconfig.yaml get all +``` + +This command lists resources that exist under the default namespace. If the API server is healthy you should see a command output similar to the following: + +``` +NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE +service/kubernetes ClusterIP 10.152.183.1 443/TCP 29m +``` + +A typical error message may look like this if the API server can not be reached: + +``` +The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port? +``` + +Check the status of the API server service: + +``` +juju exec --unit k8s/0 -- systemctl status snap.k8s.kube-apiserver +``` + +Access the logs of the API server service by running: + +``` +juju exec --unit k8s/0 -- journalctl -u snap.k8s.kube-apiserver +``` + +A failure can mean that: +* The API server is not reachable due to network issues or firewall limitations +* The API server on the particular node is unhealthy +* The control-plane node that's being reached is down + +Try reaching the API server on a different unit by retrieving the kubeconfig file with `juju run get-kubeconfig`. Please replace `#` with the desired unit's number. + +## Check the cluster nodes' health + +Confirm that the nodes in the cluster are healthy by looking for the `Ready` status: + +``` +kubectl --kubeconfig cluster-kubeconfig.yaml get nodes +``` + +You should see a command output similar to the following: + +``` +NAME STATUS ROLES AGE VERSION +juju-380ff2-0 Ready control-plane,worker 9m30s v1.32.0 +juju-380ff2-1 Ready worker 77s v1.32.0 +``` + + +## Troubleshooting an unhealthy node + +Every healthy {{ product }} node has certain services up and running. The required services depend on the type of node. + +Services running on both the control plane and worker nodes: +* `k8sd` +* `kubelet` +* `containerd` +* `kube-proxy` + +Services running only on the control-plane nodes: +* `kube-apiserver` +* `kube-controller-manager` +* `kube-scheduler` +* `k8s-dqlite` + +Services running only on the worker nodes: +* `k8s-apiserver-proxy` + +SSH into the unhealthy node by running: + +``` +juju ssh +``` + +Check the status of the services on the failing node by running: + +``` +sudo systemctl status snap.k8s. +``` + +Check the logs of a failing service by executing: + +``` +sudo journalctl -xe -u snap.k8s. +``` + +If the issue indicates a problem with the configuration of the services on the node, examine the arguments used to run these services. + +The arguments of a service on the failing node can be examined by reading the file located at `/var/snap/k8s/common/args/`. + +## Investigating system pods' health + +Check whether all of the cluster's pods are `Running` and `Ready`: + +``` +kubectl --kubeconfig cluster-kubeconfig.yaml get pods -n kube-system +``` + +The pods in the `kube-system` namespace belong to {{product}}' features such as `network`. Unhealthy pods could be related to configuration issues or nodes not meeting certain requirements. + +## Troubleshooting a failing pod + +Look at the events on a failing pod by running: + +``` +kubectl --kubeconfig cluster-kubeconfig.yaml describe pod -n +``` + +Check the logs on a failing pod by executing: + +``` +kubectl --kubeconfig cluster-kubeconfig.yaml logs -n +``` + +You can check out the upstream [debug pods documentation][] for more information. + +## Using the built-in inspection script + +{{product}} ships with a script to compile a complete report on {{product}} and its underlying system. This is an essential tool for bug reports and for investigating whether a system is (or isn’t) working. + +Inspection script can be executed on a specific unit by running the following commands: + +``` +juju exec --unit -- sudo /snap/k8s/current/k8s/scripts/inspect.sh /home/ubuntu/inspection-report.tar.gz +juju scp :/home/ubuntu/inspection-report.tar.gz ./ +``` + +The command output is similar to the following: +``` +Collecting service information +Running inspection on a control-plane node + INFO: Service k8s.containerd is running + INFO: Service k8s.kube-proxy is running + INFO: Service k8s.k8s-dqlite is running + INFO: Service k8s.k8sd is running + INFO: Service k8s.kube-apiserver is running + INFO: Service k8s.kube-controller-manager is running + INFO: Service k8s.kube-scheduler is running + INFO: Service k8s.kubelet is running +Collecting registry mirror logs +Collecting service arguments + INFO: Copy service args to the final report tarball +Collecting k8s cluster-info + INFO: Copy k8s cluster-info dump to the final report tarball +Collecting SBOM + INFO: Copy SBOM to the final report tarball +Collecting system information + INFO: Copy uname to the final report tarball + INFO: Copy snap diagnostics to the final report tarball + INFO: Copy k8s diagnostics to the final report tarball +Collecting networking information + INFO: Copy network diagnostics to the final report tarball +Building the report tarball + SUCCESS: Report tarball is at /home/ubuntu/inspection-report.tar.gz +``` + +Use the report to ensure that all necessary services are running and dive into every aspect of the system. + +## Collecting debug information + +To collect comprehensive debug output from your {{product}} cluster, install and run [juju-crashdump][] on a computer that has the Juju client installed. Please ensure that the current controller and model are pointing at your {{product}} deployment. + +``` +sudo snap install juju-crashdump --classic --channel edge +juju-crashdump -a debug-layer -a config +``` + +Running the `juju-crashdump` script will generate a tarball of debug information that includes [systemd][] unit status and logs, Juju logs, charm unit data, and Kubernetes cluster information. Please include the generated tarball when filing a bug. + +## Reporting a bug +If you cannot solve your issue and believe that the fault may lie in {{product}}, please [file an issue on the project repository][]. + +Help us deal effectively with issues by including the report obtained from the inspect script, the tarball obtained from `juju-crashdump`, as well as any additional logs, and a summary of the issue. + +You can check out the upstream [debug documentation][] for more details on troubleshooting a Kubernetes cluster. + + + +[file an issue on the project repository]: https://github.com/canonical/k8s-operator/issues/new/choose +[charm-troubleshooting-reference]: ../reference/troubleshooting +[juju-crashdump]: https://github.com/juju/juju-crashdump +[systemd]: https://systemd.io +[debug pods documentation]: https://kubernetes.io/docs/tasks/debug/debug-application/debug-pods +[debug documentation]: https://kubernetes.io/docs/tasks/debug diff --git a/docs/src/charm/reference/index.md b/docs/src/charm/reference/index.md index 48e448868..79d92a383 100644 --- a/docs/src/charm/reference/index.md +++ b/docs/src/charm/reference/index.md @@ -20,6 +20,8 @@ charm-configurations troubleshooting Community actions +troubleshooting + ``` ## Other documentation types