From 65d1942f65c780e14c31e855b29bf1dd6db9ee55 Mon Sep 17 00:00:00 2001
From: Berkay Tekin Oz <berkay.tekinoz@canonical.com>
Date: Mon, 13 Jan 2025 11:02:51 +0000
Subject: [PATCH] Add how-to troubleshoot for charm deployments

---
 docs/src/charm/howto/index.md           |   1 +
 docs/src/charm/howto/troubleshooting.md | 244 ++++++++++++++++++++++++
 docs/src/charm/reference/index.md       |   2 +
 3 files changed, 247 insertions(+)
 create mode 100644 docs/src/charm/howto/troubleshooting.md
diff --git a/docs/src/charm/howto/index.md b/docs/src/charm/howto/index.md
index 296380636..ab6a41d7e 100644
--- a/docs/src/charm/howto/index.md
+++ b/docs/src/charm/howto/index.md
@@ -26,6 +26,7 @@ custom-registry
 Upgrade patch version <upgrade-patch>
 Upgrade minor version <upgrade-minor>
 Validate the cluster <validate>
+troubleshooting
 ```
 
 ---
diff --git a/docs/src/charm/howto/troubleshooting.md b/docs/src/charm/howto/troubleshooting.md
new file mode 100644
index 000000000..509ab0cb6
--- /dev/null
+++ b/docs/src/charm/howto/troubleshooting.md
@@ -0,0 +1,244 @@
+# How to troubleshoot {{product}}
+
+Identifying issues in a Kubernetes cluster can be difficult, especially to new users. With {{product}} we aim to make deploying and managing your cluster as easy as possible. This how-to guide will walk you through the steps to troubleshoot your {{product}} cluster.
+
+## Common issues
+
+Maybe your issue has already been solved? Check out the [troubleshooting reference][charm-troubleshooting-reference] page to see a list of common issues and their solutions. Otherwise continue with this guide to help troubleshoot your {{product}} cluster.
+
+## Check the cluster status
+
+Verify that the cluster status is ready by running:
+
+```
+juju status
+```
+
+You should see a command output similar to the following:
+```
+Model        Controller           Cloud/Region         Version  SLA          Timestamp
+k8s-testing  localhost-localhost  localhost/localhost  3.6.1    unsupported  09:06:50Z
+
+App         Version  Status  Scale  Charm       Channel    Rev  Exposed  Message
+k8s         1.32.0   active      1  k8s         1.32/beta  179  no       Ready
+k8s-worker  1.32.0   active      1  k8s-worker  1.32/beta  180  no       Ready
+
+Unit           Workload  Agent  Machine  Public address  Ports     Message
+k8s-worker/0*  active    idle   1        10.94.106.154             Ready
+k8s/0*         active    idle   0        10.94.106.136   6443/tcp  Ready
+
+Machine  State    Address        Inst id        Base          AZ  Message
+0        started  10.94.106.136  juju-380ff2-0  ubuntu@24.04      Running
+1        started  10.94.106.154  juju-380ff2-1  ubuntu@24.04      Running
+```
+Interpreting the Output:
+- The `Workload` column shows the status of a given service.
+- The `Message` section details the health of a given service in the cluster. 
+- The `Agent` column reflects any activity of the Juju agent.
+
+During deployment and maintenance the workload status will reflect the node's activity. An example workload may display `maintenance` along with the message details: `Ensuring snap installation`.
+
+During normal cluster operation the `Workload` column reads `active`, the `Agent` column shows `idle`, and the messages will either read `Ready` or another descriptive term.
+
+## Test the API server health
+
+Fetch the kubeconfig file for a control-plane node in the cluster by running:
+
+```
+juju run k8s/leader get-kubeconfig | yq .kubeconfig > cluster-kubeconfig.yaml
+```
+
+```{warning}
+When running `juju run k8s/leader get-kubeconfig` you retrieve the kubeconfig file that uses one of the unit's  public IP addresses in the kubernetes endpoint. This endpoint ip can be overriden by providing a `server` argument if the api is exposed through a load-balancer.
+```
+
+Verify that the API server is healthy and reachable by running:
+
+```
+kubectl --kubeconfig cluster-kubeconfig.yaml get all
+```
+
+This command lists resources that exist under the default namespace. If the API server is healthy you should see a command output similar to the following:
+
+```
+NAME                 TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
+service/kubernetes   ClusterIP   10.152.183.1   <none>        443/TCP   29m
+```
+
+A typical error message may look like this if the API server can not be reached:
+
+```
+The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?
+```
+
+Check the status of the API server service:
+
+```
+juju exec --unit k8s/0 -- systemctl status snap.k8s.kube-apiserver
+```
+
+Access the logs of the API server service by running:
+
+```
+juju exec --unit k8s/0 -- journalctl -u snap.k8s.kube-apiserver
+```
+
+A failure can mean that:
+* The API server is not reachable due to network issues or firewall limitations
+* The API server on the particular node is unhealthy
+* The control-plane node that's being reached is down
+
+Try reaching the API server on a different unit by retrieving the kubeconfig file with `juju run <k8s/unit#> get-kubeconfig`. Please replace `#` with the desired unit's number.
+
+## Check the cluster nodes' health
+
+Confirm that the nodes in the cluster are healthy by looking for the `Ready` status:
+
+```
+kubectl --kubeconfig cluster-kubeconfig.yaml get nodes
+```
+
+You should see a command output similar to the following:
+
+```
+NAME            STATUS   ROLES                  AGE     VERSION
+juju-380ff2-0   Ready    control-plane,worker   9m30s   v1.32.0
+juju-380ff2-1   Ready    worker                 77s     v1.32.0
+```
+
+
+## Troubleshooting an unhealthy node
+
+Every healthy {{ product }} node has certain services up and running. The required services depend on the type of node.
+
+Services running on both the control plane and worker nodes:
+* `k8sd`
+* `kubelet`
+* `containerd`
+* `kube-proxy`
+
+Services running only on the control-plane nodes:
+* `kube-apiserver`
+* `kube-controller-manager`
+* `kube-scheduler`
+* `k8s-dqlite`
+
+Services running only on the worker nodes:
+* `k8s-apiserver-proxy`
+
+SSH into the unhealthy node by running:
+
+```
+juju ssh <k8s/unit#>
+```
+
+Check the status of the services on the failing node by running:
+
+```
+sudo systemctl status snap.k8s.<service>
+```
+
+Check the logs of a failing service by executing:
+
+```
+sudo journalctl -xe -u snap.k8s.<service>
+```
+
+If the issue indicates a problem with the configuration of the services on the node, examine the arguments used to run these services.
+
+The arguments of a service on the failing node can be examined by reading the file located at `/var/snap/k8s/common/args/<service>`.
+
+## Investigating system pods' health
+
+Check whether all of the cluster's pods are `Running` and `Ready`:
+
+```
+kubectl --kubeconfig cluster-kubeconfig.yaml get pods -n kube-system
+```
+
+The pods in the `kube-system` namespace belong to {{product}}' features such as `network`. Unhealthy pods could be related to configuration issues or nodes not meeting certain requirements.
+
+## Troubleshooting a failing pod
+
+Look at the events on a failing pod by running:
+
+```
+kubectl --kubeconfig cluster-kubeconfig.yaml describe pod <pod-name> -n <namespace>
+```
+
+Check the logs on a failing pod by executing:
+
+```
+kubectl --kubeconfig cluster-kubeconfig.yaml logs <pod-name> -n <namespace>
+```
+
+You can check out the upstream [debug pods documentation][] for more information.
+
+## Using the built-in inspection script
+
+{{product}} ships with a script to compile a complete report on {{product}} and its underlying system. This is an essential tool for bug reports and for investigating whether a system is (or isn’t) working.
+
+Inspection script can be executed on a specific unit by running the following commands:
+
+```
+juju exec --unit <k8s/unit#> -- sudo /snap/k8s/current/k8s/scripts/inspect.sh /home/ubuntu/inspection-report.tar.gz
+juju scp <k8s/unit#>:/home/ubuntu/inspection-report.tar.gz ./
+```
+
+The command output is similar to the following:
+```
+Collecting service information
+Running inspection on a control-plane node
+ INFO:  Service k8s.containerd is running
+ INFO:  Service k8s.kube-proxy is running
+ INFO:  Service k8s.k8s-dqlite is running
+ INFO:  Service k8s.k8sd is running
+ INFO:  Service k8s.kube-apiserver is running
+ INFO:  Service k8s.kube-controller-manager is running
+ INFO:  Service k8s.kube-scheduler is running
+ INFO:  Service k8s.kubelet is running
+Collecting registry mirror logs
+Collecting service arguments
+ INFO:  Copy service args to the final report tarball
+Collecting k8s cluster-info
+ INFO:  Copy k8s cluster-info dump to the final report tarball
+Collecting SBOM
+ INFO:  Copy SBOM to the final report tarball
+Collecting system information
+ INFO:  Copy uname to the final report tarball
+ INFO:  Copy snap diagnostics to the final report tarball
+ INFO:  Copy k8s diagnostics to the final report tarball
+Collecting networking information
+ INFO:  Copy network diagnostics to the final report tarball
+Building the report tarball
+ SUCCESS:  Report tarball is at /home/ubuntu/inspection-report.tar.gz
+```
+
+Use the report to ensure that all necessary services are running and dive into every aspect of the system.
+
+## Collecting debug information
+
+To collect comprehensive debug output from your {{product}} cluster, install and run [juju-crashdump][] on a computer that has the Juju client installed. Please ensure that the current controller and model are pointing at your {{product}} deployment.
+
+```
+sudo snap install juju-crashdump --classic --channel edge
+juju-crashdump -a debug-layer -a config
+```
+
+Running the `juju-crashdump` script will generate a tarball of debug information that includes [systemd][] unit status and logs, Juju logs, charm unit data, and Kubernetes cluster information. Please include the generated tarball when filing a bug.
+
+## Reporting a bug
+If you cannot solve your issue and believe that the fault may lie in {{product}}, please [file an issue on the project repository][].
+
+Help us deal effectively with issues by including the report obtained from the inspect script, the tarball obtained from `juju-crashdump`, as well as any additional logs, and a summary of the issue.
+
+You can check out the upstream [debug documentation][] for more details on troubleshooting a Kubernetes cluster.
+
+<!-- Links -->
+
+[file an issue on the project repository]: https://github.com/canonical/k8s-operator/issues/new/choose
+[charm-troubleshooting-reference]: ../reference/troubleshooting
+[juju-crashdump]: https://github.com/juju/juju-crashdump
+[systemd]: https://systemd.io
+[debug pods documentation]: https://kubernetes.io/docs/tasks/debug/debug-application/debug-pods
+[debug documentation]: https://kubernetes.io/docs/tasks/debug
diff --git a/docs/src/charm/reference/index.md b/docs/src/charm/reference/index.md
index 48e448868..79d92a383 100644
--- a/docs/src/charm/reference/index.md
+++ b/docs/src/charm/reference/index.md
@@ -20,6 +20,8 @@ charm-configurations
 troubleshooting
 Community <community>
 actions
+troubleshooting
+
 ```
 
 ## Other documentation types