linode · prajwalvathreya · Oct 9, 2024 · Sep 23, 2024 · Sep 23, 2024 · Sep 24, 2024
@@ -65,6 +65,10 @@ HELM_VERSION         ?= "v0.2.1"
 CAPL_VERSION         ?= "v0.6.4"
 CONTROLPLANE_NODES   ?= 1
 WORKER_NODES         ?= 1
+GRAFANA_PORT ?= 3000
+GRAFANA_USERNAME ?= admin
+GRAFANA_PASSWORD ?= admin
+DATA_RETENTION_PERIOD ?= 15d  # Prometheus data retention period
 
 .PHONY: build
 build:
@@ -185,3 +189,27 @@ release:
 	cp ./internal/driver/deploy/releases/linode-blockstorage-csi-driver-$(IMAGE_VERSION).yaml ./$(RELEASE_DIR)
 	sed -e 's/appVersion: "latest"/appVersion: "$(IMAGE_VERSION)"/g' ./helm-chart/csi-driver/Chart.yaml
 	tar -czvf ./$(RELEASE_DIR)/helm-chart-$(IMAGE_VERSION).tgz -C ./helm-chart/csi-driver .
+
+#####################################################################
+# Grafana Dashboard Installation End to End installation
+#####################################################################
+.PHONY: grafana-dashboard
+grafana-dashboard: install-prometheus install-grafana setup-dashboard
+
+#####################################################################
+# Monitoring Tools Installation
+#####################################################################
+.PHONY: install-prometheus
+install-prometheus:
+	KUBECONFIG=test-cluster-kubeconfig.yaml DATA_RETENTION_PERIOD=$(DATA_RETENTION_PERIOD) \
+		./hack/install-prometheus.sh --timeout=600s
+
+.PHONY: install-grafana
+install-grafana:
+	KUBECONFIG=test-cluster-kubeconfig.yaml GRAFANA_PORT=$(GRAFANA_PORT) \
+		GRAFANA_USERNAME=$(GRAFANA_USERNAME) GRAFANA_PASSWORD=$(GRAFANA_PASSWORD) \
+		./hack/install-grafana.sh --timeout=600s
+
+.PHONY: setup-dashboard
+setup-dashboard:
+	KUBECONFIG=test-cluster-kubeconfig.yaml ./hack/setup-dashboard.sh --namespace=monitoring --dashboard-file=observability/metrics/dashboard.json
@@ -26,6 +26,8 @@
   - [Creating a Development Cluster](docs/development-setup.md#️-creating-a-development-cluster)
   - [Running E2E Tests](docs/testing.md)
   - [Contributing](docs/contributing.md)
+- [Observability](docs/observability.md)
+  - [Metrics](docs/metrics-documentation.md)
 - [License](#license)
 - [Disclaimers](#-disclaimers)
 - [Community](#-join-us-on-slack)

@@ -0,0 +1,169 @@
+## Grafana Dashboard Documentation: **CSI Driver Metrics**
+
+### 1. **Introduction**
+This Grafana dashboard provides an in-depth view of the CSI Driver operations for Linode Block Storage, with real-time data on volume creation, deletion, publication, and expansion. It also tracks persistent volume claims and potential runtime errors. The data is sourced from Prometheus, making it ideal for monitoring and diagnosing issues with CSI Driver operations.
+
+### 2. **Dashboard Structure**
+The dashboard is divided into several panels. Each panel focuses on a different aspect of CSI Driver operations, including Create/Delete/Publish Volume requests, runtime operation errors, and Persistent Volume (PV) and Persistent Volume Claim (PVC) events.
+
+---
+
+### 3. **Key Metrics and Visualizations with Graphs**
+
+---
+
+##### **Key points to know in-order to understand the graphs**:
+
+- The y-axis is scaled by 1000. To get the correct number, multiply the decimal by 1000.
+- The graphs which show total time taken, show time taken in `seconds`.
+- The example graphs are plotted over a period of 48 hours, due to which the x-axis contains date and time.
+- The spikes you see happened during e2e tests.
+
+---
+
+#### **Controller Create Volume**
+
+- **Create Volume Requests**  
+   - **Description**: Displays the total number of volume creation requests made to the CSI Driver.  
+   - **Query**: `csi_sidecar_operations_seconds_count{method_name="/csi.v1.Controller/CreateVolume"}`
+   - **Graph**:  
+   ![Create Volume Request](example-images/create-volume-request.jpg)
+   - **Explanation**: This graph shows the rate of volume creation requests over time. Spikes indicate increased provisioning activity.
+
+- **Total Time Taken to Create Volume**  
+   - **Description**: Displays the cumulative time taken to create volumes.  
+   - **Query**: `csi_sidecar_operations_seconds_sum{method_name="/csi.v1.Controller/CreateVolume"}`
+   - **Graph**:  
+   ![Total Time to Create Volume](example-images/tt-create-volume-request.jpg)
+   - **Explanation**: Tracks the total amount of time spent creating volumes, useful for identifying delays in provisioning.
+
+---
+
+#### **Controller Delete Volume**
+
+- **Delete Volume Requests**  
+   - **Description**: Shows the number of requests to delete volumes through the CSI Driver.  
+   - **Query**: `csi_sidecar_operations_seconds_count{method_name="/csi.v1.Controller/DeleteVolume"}`
+   - **Graph**:  
+   ![Delete Volume Request](example-images/delete-volume-request.jpg)
+   - **Explanation**: This graph tracks how often volumes are deleted. A consistent increase means regular cleanup of resources.
+
+- **Total Time Taken to Delete Volume**  
+   - **Description**: Tracks the time spent deleting volumes.  
+   - **Query**: `csi_sidecar_operations_seconds_sum{method_name="/csi.v1.Controller/DeleteVolume"}`
+   - **Graph**:  
+   ![Total Time to Delete Volume](example-images/tt-delete-volume-request.jpg)
+   - **Explanation**: Shows the time taken to delete volumes, highlighting the efficiency of resource cleanup operations.
+
+---
+
+#### **Controller Expand Volume**
+
+- **Expand Volume Requests**  
+   - **Description**: Monitors requests to expand volumes.  
+   - **Query**: `csi_sidecar_operations_seconds_count{method_name="/csi.v1.Controller/ControllerExpandVolume"}`
+   - **Graph**:  
+   ![Expand Volume Request](example-images/expand-volume-request.jpg)
+   - **Explanation**: This graph tracks how frequently volume expansion operations occur.
+
+- **Total Time Taken to Expand Volume**  
+   - **Description**: Displays the cumulative time taken to expand volumes.  
+   - **Query**: `csi_sidecar_operations_seconds_sum{method_name="/csi.v1.Controller/ControllerExpandVolume"}`
+   - **Graph**:  
+   ![Total Time to Expand Volume](example-images/tt-expand-volume-request.jpg)
+   - **Explanation**: Tracks the total time taken to expand volumes.
+
+---
+
+#### **Controller Publish Volume**
+
+- **Publish Volume Requests**  
+   - **Description**: The number of requests made to attach or publish volumes to nodes.  
+   - **Query**: `csi_sidecar_operations_seconds_count{method_name="/csi.v1.Controller/ControllerPublishVolume"}`
+   - **Graph**:  
+   ![Publish Volume Request](example-images/publish-volume-request.jpg)
+   - **Explanation**: This graph tracks how often volumes are published (attached) to nodes, indicating mounting operations.
+
+- **Total Time Taken to Publish Volume**  
+   - **Description**: Displays the cumulative time taken to publish (attach) volumes to nodes.  
+   - **Query**: `csi_sidecar_operations_seconds_sum{method_name="/csi.v1.Controller/ControllerPublishVolume"}`
+   - **Graph**:  
+   ![Total Time to Publish Volume](example-images/tt-publish-volume-request.jpg)
+   - **Explanation**: Tracks the total time spent publishing volumes to nodes.
+
+---
+
+#### **Controller Unpublish Volume**
+
+- **Unpublish Volume Requests**  
+    - **Description**: Tracks the number of requests to unpublish volumes.  
+    - **Query**: `csi_sidecar_operations_seconds_count{method_name="/csi.v1.Controller/ControllerUnpublishVolume"}`
+    - **Graph**:  
+    ![Unpublish Volume Requests](example-images/unpublish-volume-request.jpg)
+    - **Explanation**: This graph shows how frequently volumes are unpublished (detached) from nodes.
+
+- **Total Time Taken to Unpublish Volume**  
+    - **Description**: Displays the cumulative time taken to unpublish (detach) volumes from nodes.  
+    - **Query**: `csi_sidecar_operations_seconds_sum{method_name="/csi.v1.Controller/ControllerUnpublishVolume"}`
+    - **Graph**:  
+    ![Total Time to Unpublish Volume](example-images/tt-unpublish-volume-request.jpg)
+    - **Explanation**: Tracks the total time spent unpublishing volumes from nodes.
+
+---
+
+### 4. **Additional Metrics**
+
+---
+
+#### **Persistent Volumes (PV)**
+
+- **Description**: Displays the total number of PV-related events that the CSI controller processed.  
+- **Query**: `workqueue_adds_total{name="volumes"}`
+- **Graph**:  
+![Persistent Volumes](example-images/pv.jpg)
+- **Explanation**: This graph shows how many PV requests were made, indicating the provisioning of new storage resources.
+
+---
+
+#### **Volume Claims (PVC)**
+
+- **Description**: Tracks the number of PVC-related events that the controller reconciles.  
+- **Query**: `workqueue_adds_total{name="claims"}`
+- **Graph**:  
+![PVC Events](example-images/pvc.jpg)
+- **Explanation**: This graph tracks PVC-related events, providing insights into the frequency of new claims or bindings.
+
+---
+
+#### **Runtime Operation Errors**
+
+- **Description**: Visualizes errors encountered by the CSI Driver during operations.  
+- **Query**: `kubelet_runtime_operations_errors_total`
+- **Graph**:  
+![Runtime Operation Errors](example-images/runtime-error.jpg)
+- **Explanation**: A rise in runtime errors indicates potential issues within the Kubernetes nodes or the CSI components.
+
+---
+
+#### **CSI Sidecar Operations Seconds Sum**
+
+- **Description**: Shows the cumulative time taken for operations handled by CSI sidecars (attacher, provisioner, etc.).  
+- **Query**: `csi_sidecar_operations_seconds_sum`
+- **Graph**:  
+![Sidecar Operations Seconds Sum](example-images/sidecar-operations-time-sum.jpg)
+- **Explanation**: This graph tracks the total time consumed by all CSI operations, helping identify potential bottlenecks.
+
+---
+
+### 5. **Missing Metrics/ Future Scope**
+
+---
+
+#### **Volume Utilization Metrics**:
+- **Volume Size**: Track the size of volumes currently in use to better understand resource consumption.
+- **Potential Implementation**: Metrics could be added to track how much space is being utilized by each volume, ensuring optimal usage and highlighting volumes nearing full capacity.
+
+#### **Node Metrics**:
+- **Node Attachments**: Track the total number of volumes attached to each node.
+- **Node Publish/Unpublish**: Track how often volumes are published (attached) and unpublished (detached) from nodes, giving better visibility into volume mounting and unmounting operations.
+- **Node Stage/Unstage**: Monitor staging and unstaging operations to identify any potential delays or issues when preparing a volume for use on a node.
@@ -0,0 +1,152 @@
+# Observability with Grafana Dashboard
+
+This document explains how to use the `grafana-dashboard` make target to install and configure observability tools, including Prometheus and Grafana, on your Kubernetes cluster. The setup uses Helm charts to install Prometheus and Grafana, provides a Prometheus data source, and applies a Grafana dashboard configuration.
+
+## Prerequisites
+
+Ensure the following tools are installed on your local machine:
+- **Kubernetes**: A running Kubernetes cluster.
+- **kubectl**: To manage the cluster.
+- **Helm**: To install and manage Helm charts for Prometheus and Grafana.
+
+You should also have access to the Kubernetes cluster's kubeconfig file (`test-cluster-kubeconfig.yaml`), which will be used for running the make target.
+
+Here’s a more detailed explanation of the steps for opting in to the metrics for the CSI driver. The commands involve first deleting the existing CSI driver and then reinstalling it with metrics enabled:
+
+---
+
+## Steps to Opt-In for the CSI Driver Metrics
+
+To enable the metrics collection for the Linode CSI driver, follow the steps below. These steps involve exporting a new Helm template with metrics enabled, deleting the current CSI driver release, and applying the newly generated configuration.
+
+### 1. Export the Helm Template for the CSI Driver with Metrics Enabled
+
+First, you need to generate a new Helm template for the Linode CSI driver with the `enable_metrics` flag set to `true`. This ensures that the CSI driver is configured to expose its metrics.
+
+```bash
+helm template linode-csi-driver \
+  --set apiToken="${LINODE_API_TOKEN}" \
+  --set region="${REGION}" \
+  --set enable_metrics=true \
+  helm-chart/csi-driver --namespace kube-system > csi.yaml
+```
+
+### 2. Delete the Existing Release of the CSI Driver
+
+Before applying the new configuration, you need to delete the current release of the Linode CSI driver. This step is necessary because the default CSI driver installation does not have metrics enabled, and Helm doesn’t handle changes to some components gracefully without a clean reinstall.
+
+```bash
+kubectl delete -f csi.yaml --namespace kube-system
+```
+
+### 3. Apply the Newly Generated Template
+
+Once the old CSI driver installation is deleted, you can apply the newly generated template that includes the metrics configuration.
+
+```bash
+kubectl apply -f csi.yaml
+```
+
+## Steps to Install the Grafana Dashboard
+
+### 1. Build and Set Up the Cluster (Optional)
+If you haven’t already set up your Kubernetes cluster with the necessary CSI driver and Prometheus metrics services, you can do so by running the following command:
+```bash
+make mgmt-and-capl-cluster
+```
+This command creates a management cluster and CAPL (Cluster API for Linode) cluster, installs the Linode CSI driver, and applies the necessary configurations to expose the CSI metrics.
+
+### 2. Run the Grafana Dashboard Setup
+The `grafana-dashboard` make target combines the installation of Prometheus, Grafana, and the dashboard configuration. It ensures that Prometheus is installed and connected to Grafana, and that a pre-configured dashboard is applied. To execute this setup, run:
+
+```bash
+make grafana-dashboard
+```
+
+#### What Happens During the Setup?
+
+This target combines three separate make targets:
+1. **`install-prometheus`**: Installs Prometheus using a Helm chart in the `monitoring` namespace. Prometheus is configured to scrape metrics from the CSI driver and other services.
+2. **`install-grafana`**: Installs Grafana using a Helm chart in the `monitoring` namespace, with Prometheus as its data source.
+3. **`setup-dashboard`**: Sets up a pre-configured Grafana dashboard by applying a ConfigMap containing the dashboard JSON (`observability/metrics/dashboard.json`).
+
+### 3. Accessing the Grafana Dashboard
+
+Once the setup is complete, you can access the Grafana dashboard through the configured LoadBalancer service. After the setup script runs, the external IP of the LoadBalancer is printed, and you can access Grafana by opening the following URL in your browser:
+
+```
+http://<LoadBalancer-EXTERNAL-IP>
+```
+
+Log in using the following credentials:
+- Username: `admin`
+- Password: `admin`
+
+These credentials can be customized via environment variables in the `install-monitoring-tools.sh` script if needed.
+
+### 4. Stopping the Port Forwarding (if used)
+
+If you are using port forwarding instead of a LoadBalancer, and you wish to stop the forwarding, run:
+```bash
+kill <PID>
+```
+Replace `<PID>` with the process ID provided by the script during the setup.
+
+If you do not have access to the script output, run:
+```bash
+ps -ef | grep 'kubectl port-forward' | grep -v grep
+```
+This will give you details about the process and also the `PID`.
+
+## Customizing the Setup
+
+- **Namespace**: The default namespace for the observability tools is `monitoring`. You can modify this by passing the `--namespace` flag or editing the `install-monitoring-tools.sh` script and changing the `NAMESPACE` variable.
+
+- **Grafana Dashboard Configuration**: The default dashboard configuration is stored in `observability/metrics/dashboard.json`. To apply a different dashboard, replace the contents of this file before running the `make grafana-dashboard` target.
+
+- **Prometheus Data Source**: The default data source is Prometheus, as defined in the Helm chart configuration. If you wish to use a different data source, modify the `helm upgrade` command in `install-monitoring-tools.sh`.
+
+## Makefile Targets
+
+### `install-prometheus`
+Installs Prometheus in the `monitoring` namespace using a Helm chart. Prometheus scrapes metrics from the CSI driver and other services in the cluster.
+
+```bash
+make install-prometheus
+```
+
+### `install-grafana`
+Installs Grafana in the `monitoring` namespace using a Helm chart. Prometheus is set as the data source for Grafana.
+
+```bash
+make install-grafana
+```
+
+### `setup-dashboard`
+Sets up the pre-configured Grafana dashboard by applying a ConfigMap containing the dashboard JSON. This ConfigMap is created from the `observability/metrics/dashboard.json` file.
+
+```bash
+make setup-dashboard
+```
+
+### `grafana-dashboard`
+This is a combined target that installs Prometheus, Grafana, and configures the Grafana dashboard. It runs the `install-prometheus`, `install-grafana`, and `setup-dashboard` targets sequentially.
+
+```bash
+make grafana-dashboard
+```
+
+## Troubleshooting
+
+If you encounter issues during the installation process, check the logs and status of the Prometheus and Grafana pods:
+```bash
+kubectl get pods -n monitoring
+kubectl logs <prometheus-pod-name> -n monitoring
+kubectl logs <grafana-pod-name> -n monitoring
+```
+
+This setup provides a quick and easy way to enable observability using Grafana dashboards, ensuring that you have visibility into your Kubernetes cluster and CSI driver operations.
+
+---
+
+This updated documentation reflects the newly structured make targets for easier installation and management of Prometheus, Grafana, and the dashboard configuration. Let me know if you'd like further adjustments!