Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics investigation for CSI driver and dashboard creation on Grafana #274

Merged
merged 36 commits into from
Oct 9, 2024
Merged
Show file tree
Hide file tree
Changes from 31 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
8387728
Added ports for the sidecars to allow prometheus to scrape the metrics
prajwalvathreya Sep 23, 2024
0c39d75
Fixed error in linode-csi-plugin container due to incorrect metrics port
prajwalvathreya Sep 23, 2024
68b9b98
Added documentation and example graphs for metrics in the csi-driver.
prajwalvathreya Sep 24, 2024
361c077
updated graphs and documentation.
prajwalvathreya Sep 25, 2024
0a4609e
added line break after future scope to keep the doc consistent
prajwalvathreya Sep 25, 2024
4df77e3
added additional node metrics
prajwalvathreya Sep 25, 2024
06d04b3
added clarification on the unit of measurement of time
prajwalvathreya Sep 25, 2024
fcf133b
fixed typo
prajwalvathreya Sep 25, 2024
aedc721
Moved metrics-documentation.md and example-images folder to the docs …
prajwalvathreya Sep 25, 2024
1db1a39
Merge branch 'refs/heads/main' into metrics-endpoint
prajwalvathreya Sep 26, 2024
28691b8
- Created make target for creating a grafana-dashboard
prajwalvathreya Sep 30, 2024
7f133cc
- created services to expose metrics to prometheus
prajwalvathreya Sep 30, 2024
cb43eff
- updated install script to run process in the background
prajwalvathreya Oct 1, 2024
4b09fbf
Merge branch 'main' into metrics-endpoint
prajwalvathreya Oct 1, 2024
c10ee59
- fixed conflict in Makefile
prajwalvathreya Oct 1, 2024
960c574
Update hack/install-monitoring-tools.sh
prajwalvathreya Oct 1, 2024
016edfa
Updated the syntax of passing the CLUSTER_NAME variable
prajwalvathreya Oct 1, 2024
efeea6e
Update hack/install-monitoring-tools.sh, namespace creation
prajwalvathreya Oct 1, 2024
9370d03
Update hack/install-monitoring-tools.sh Grafana helm chart update
prajwalvathreya Oct 1, 2024
5d92fbb
Update hack/install-monitoring-tools.sh Prometheus helm chart update
prajwalvathreya Oct 1, 2024
c0a1c56
- added environment variables for username, password, data retention …
prajwalvathreya Oct 1, 2024
069b55d
- removed echo used for debugging
prajwalvathreya Oct 1, 2024
802245f
Merge branch 'main' into metrics-endpoint
prajwalvathreya Oct 2, 2024
9bc3beb
- updated the script to 3 make targets
prajwalvathreya Oct 3, 2024
c120fa7
- updated templates to opt in to install using helm
prajwalvathreya Oct 4, 2024
3276b46
- fixed container port mapping which was causing containers to crash …
prajwalvathreya Oct 4, 2024
9e18c05
- resolving Makefile conflict
prajwalvathreya Oct 7, 2024
ceac735
Merge branch 'main' into metrics-endpoint
prajwalvathreya Oct 7, 2024
2977e89
- updated to helm chart to expose drivers based on passed flag `enabl…
prajwalvathreya Oct 7, 2024
9380f5d
- updated documentation to explain how to use the helm chart to enabl…
prajwalvathreya Oct 7, 2024
6fe3e49
Merge branch 'main' into metrics-endpoint
prajwalvathreya Oct 7, 2024
9d97d6e
- made changes to install metrics services through helm chart
prajwalvathreya Oct 8, 2024
79750ca
- reverted csi-driver image to latest
prajwalvathreya Oct 8, 2024
07c63df
- updated documentation to explain modifications to make targets
prajwalvathreya Oct 8, 2024
6ac52d6
- updated comment to a more sensible one
prajwalvathreya Oct 8, 2024
62e806d
- updated documentation to be less verbose
prajwalvathreya Oct 8, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,10 @@ HELM_VERSION ?= "v0.2.1"
CAPL_VERSION ?= "v0.6.4"
CONTROLPLANE_NODES ?= 1
WORKER_NODES ?= 1
GRAFANA_PORT ?= 3000
GRAFANA_USERNAME ?= admin
GRAFANA_PASSWORD ?= admin
DATA_RETENTION_PERIOD ?= 15d # Prometheus data retention period

.PHONY: build
build:
Expand Down Expand Up @@ -185,3 +189,27 @@ release:
cp ./internal/driver/deploy/releases/linode-blockstorage-csi-driver-$(IMAGE_VERSION).yaml ./$(RELEASE_DIR)
sed -e 's/appVersion: "latest"/appVersion: "$(IMAGE_VERSION)"/g' ./helm-chart/csi-driver/Chart.yaml
tar -czvf ./$(RELEASE_DIR)/helm-chart-$(IMAGE_VERSION).tgz -C ./helm-chart/csi-driver .

#####################################################################
# Grafana Dashboard Installation End to End installation
#####################################################################
.PHONY: grafana-dashboard
grafana-dashboard: install-prometheus install-grafana setup-dashboard

#####################################################################
# Monitoring Tools Installation
#####################################################################
.PHONY: install-prometheus
install-prometheus:
KUBECONFIG=test-cluster-kubeconfig.yaml DATA_RETENTION_PERIOD=$(DATA_RETENTION_PERIOD) \
./hack/install-prometheus.sh --timeout=600s

.PHONY: install-grafana
install-grafana:
KUBECONFIG=test-cluster-kubeconfig.yaml GRAFANA_PORT=$(GRAFANA_PORT) \
GRAFANA_USERNAME=$(GRAFANA_USERNAME) GRAFANA_PASSWORD=$(GRAFANA_PASSWORD) \
./hack/install-grafana.sh --timeout=600s

.PHONY: setup-dashboard
setup-dashboard:
KUBECONFIG=test-cluster-kubeconfig.yaml ./hack/setup-dashboard.sh --namespace=monitoring --dashboard-file=observability/metrics/dashboard.json
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@
- [Creating a Development Cluster](docs/development-setup.md#️-creating-a-development-cluster)
- [Running E2E Tests](docs/testing.md)
- [Contributing](docs/contributing.md)
- [Observability](docs/observability.md)
- [Metrics](docs/metrics-documentation.md)
- [License](#license)
- [Disclaimers](#-disclaimers)
- [Community](#-join-us-on-slack)
Expand Down
Binary file added docs/example-images/create-volume-request.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/example-images/delete-volume-request.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/example-images/expand-volume-request.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/example-images/publish-volume-request.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/example-images/pv.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/example-images/pvc.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/example-images/runtime-error.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
169 changes: 169 additions & 0 deletions docs/metrics-documentation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
## Grafana Dashboard Documentation: **CSI Driver Metrics**

### 1. **Introduction**
This Grafana dashboard provides an in-depth view of the CSI Driver operations for Linode Block Storage, with real-time data on volume creation, deletion, publication, and expansion. It also tracks persistent volume claims and potential runtime errors. The data is sourced from Prometheus, making it ideal for monitoring and diagnosing issues with CSI Driver operations.

### 2. **Dashboard Structure**
The dashboard is divided into several panels. Each panel focuses on a different aspect of CSI Driver operations, including Create/Delete/Publish Volume requests, runtime operation errors, and Persistent Volume (PV) and Persistent Volume Claim (PVC) events.

---

### 3. **Key Metrics and Visualizations with Graphs**

---

##### **Key points to know in-order to understand the graphs**:

- The y-axis is scaled by 1000. To get the correct number, multiply the decimal by 1000.
- The graphs which show total time taken, show time taken in `seconds`.
- The example graphs are plotted over a period of 48 hours, due to which the x-axis contains date and time.
- The spikes you see happened during e2e tests.

---

#### **Controller Create Volume**

- **Create Volume Requests**
- **Description**: Displays the total number of volume creation requests made to the CSI Driver.
- **Query**: `csi_sidecar_operations_seconds_count{method_name="/csi.v1.Controller/CreateVolume"}`
- **Graph**:
![Create Volume Request](example-images/create-volume-request.jpg)
- **Explanation**: This graph shows the rate of volume creation requests over time. Spikes indicate increased provisioning activity.

- **Total Time Taken to Create Volume**
- **Description**: Displays the cumulative time taken to create volumes.
- **Query**: `csi_sidecar_operations_seconds_sum{method_name="/csi.v1.Controller/CreateVolume"}`
- **Graph**:
![Total Time to Create Volume](example-images/tt-create-volume-request.jpg)
- **Explanation**: Tracks the total amount of time spent creating volumes, useful for identifying delays in provisioning.

---

#### **Controller Delete Volume**

- **Delete Volume Requests**
- **Description**: Shows the number of requests to delete volumes through the CSI Driver.
- **Query**: `csi_sidecar_operations_seconds_count{method_name="/csi.v1.Controller/DeleteVolume"}`
- **Graph**:
![Delete Volume Request](example-images/delete-volume-request.jpg)
- **Explanation**: This graph tracks how often volumes are deleted. A consistent increase means regular cleanup of resources.

- **Total Time Taken to Delete Volume**
- **Description**: Tracks the time spent deleting volumes.
- **Query**: `csi_sidecar_operations_seconds_sum{method_name="/csi.v1.Controller/DeleteVolume"}`
- **Graph**:
![Total Time to Delete Volume](example-images/tt-delete-volume-request.jpg)
- **Explanation**: Shows the time taken to delete volumes, highlighting the efficiency of resource cleanup operations.

---

#### **Controller Expand Volume**

- **Expand Volume Requests**
- **Description**: Monitors requests to expand volumes.
- **Query**: `csi_sidecar_operations_seconds_count{method_name="/csi.v1.Controller/ControllerExpandVolume"}`
- **Graph**:
![Expand Volume Request](example-images/expand-volume-request.jpg)
- **Explanation**: This graph tracks how frequently volume expansion operations occur.

- **Total Time Taken to Expand Volume**
- **Description**: Displays the cumulative time taken to expand volumes.
- **Query**: `csi_sidecar_operations_seconds_sum{method_name="/csi.v1.Controller/ControllerExpandVolume"}`
- **Graph**:
![Total Time to Expand Volume](example-images/tt-expand-volume-request.jpg)
- **Explanation**: Tracks the total time taken to expand volumes.

---

#### **Controller Publish Volume**

- **Publish Volume Requests**
- **Description**: The number of requests made to attach or publish volumes to nodes.
- **Query**: `csi_sidecar_operations_seconds_count{method_name="/csi.v1.Controller/ControllerPublishVolume"}`
- **Graph**:
![Publish Volume Request](example-images/publish-volume-request.jpg)
- **Explanation**: This graph tracks how often volumes are published (attached) to nodes, indicating mounting operations.

- **Total Time Taken to Publish Volume**
- **Description**: Displays the cumulative time taken to publish (attach) volumes to nodes.
- **Query**: `csi_sidecar_operations_seconds_sum{method_name="/csi.v1.Controller/ControllerPublishVolume"}`
- **Graph**:
![Total Time to Publish Volume](example-images/tt-publish-volume-request.jpg)
- **Explanation**: Tracks the total time spent publishing volumes to nodes.

---

#### **Controller Unpublish Volume**

- **Unpublish Volume Requests**
- **Description**: Tracks the number of requests to unpublish volumes.
- **Query**: `csi_sidecar_operations_seconds_count{method_name="/csi.v1.Controller/ControllerUnpublishVolume"}`
- **Graph**:
![Unpublish Volume Requests](example-images/unpublish-volume-request.jpg)
- **Explanation**: This graph shows how frequently volumes are unpublished (detached) from nodes.

- **Total Time Taken to Unpublish Volume**
- **Description**: Displays the cumulative time taken to unpublish (detach) volumes from nodes.
- **Query**: `csi_sidecar_operations_seconds_sum{method_name="/csi.v1.Controller/ControllerUnpublishVolume"}`
- **Graph**:
![Total Time to Unpublish Volume](example-images/tt-unpublish-volume-request.jpg)
- **Explanation**: Tracks the total time spent unpublishing volumes from nodes.

---

### 4. **Additional Metrics**

---

#### **Persistent Volumes (PV)**

- **Description**: Displays the total number of PV-related events that the CSI controller processed.
- **Query**: `workqueue_adds_total{name="volumes"}`
- **Graph**:
![Persistent Volumes](example-images/pv.jpg)
- **Explanation**: This graph shows how many PV requests were made, indicating the provisioning of new storage resources.

---

#### **Volume Claims (PVC)**

- **Description**: Tracks the number of PVC-related events that the controller reconciles.
- **Query**: `workqueue_adds_total{name="claims"}`
- **Graph**:
![PVC Events](example-images/pvc.jpg)
- **Explanation**: This graph tracks PVC-related events, providing insights into the frequency of new claims or bindings.

---

#### **Runtime Operation Errors**

- **Description**: Visualizes errors encountered by the CSI Driver during operations.
- **Query**: `kubelet_runtime_operations_errors_total`
- **Graph**:
![Runtime Operation Errors](example-images/runtime-error.jpg)
- **Explanation**: A rise in runtime errors indicates potential issues within the Kubernetes nodes or the CSI components.

---

#### **CSI Sidecar Operations Seconds Sum**

- **Description**: Shows the cumulative time taken for operations handled by CSI sidecars (attacher, provisioner, etc.).
- **Query**: `csi_sidecar_operations_seconds_sum`
- **Graph**:
![Sidecar Operations Seconds Sum](example-images/sidecar-operations-time-sum.jpg)
- **Explanation**: This graph tracks the total time consumed by all CSI operations, helping identify potential bottlenecks.

---

### 5. **Missing Metrics/ Future Scope**

---

#### **Volume Utilization Metrics**:
- **Volume Size**: Track the size of volumes currently in use to better understand resource consumption.
- **Potential Implementation**: Metrics could be added to track how much space is being utilized by each volume, ensuring optimal usage and highlighting volumes nearing full capacity.

#### **Node Metrics**:
- **Node Attachments**: Track the total number of volumes attached to each node.
- **Node Publish/Unpublish**: Track how often volumes are published (attached) and unpublished (detached) from nodes, giving better visibility into volume mounting and unmounting operations.
- **Node Stage/Unstage**: Monitor staging and unstaging operations to identify any potential delays or issues when preparing a volume for use on a node.
152 changes: 152 additions & 0 deletions docs/observability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
# Observability with Grafana Dashboard

This document explains how to use the `grafana-dashboard` make target to install and configure observability tools, including Prometheus and Grafana, on your Kubernetes cluster. The setup uses Helm charts to install Prometheus and Grafana, provides a Prometheus data source, and applies a Grafana dashboard configuration.

## Prerequisites

Ensure the following tools are installed on your local machine:
- **Kubernetes**: A running Kubernetes cluster.
- **kubectl**: To manage the cluster.
- **Helm**: To install and manage Helm charts for Prometheus and Grafana.

You should also have access to the Kubernetes cluster's kubeconfig file (`test-cluster-kubeconfig.yaml`), which will be used for running the make target.

Here’s a more detailed explanation of the steps for opting in to the metrics for the CSI driver. The commands involve first deleting the existing CSI driver and then reinstalling it with metrics enabled:

---

## Steps to Opt-In for the CSI Driver Metrics

To enable the metrics collection for the Linode CSI driver, follow the steps below. These steps involve exporting a new Helm template with metrics enabled, deleting the current CSI driver release, and applying the newly generated configuration.

### 1. Export the Helm Template for the CSI Driver with Metrics Enabled

First, you need to generate a new Helm template for the Linode CSI driver with the `enable_metrics` flag set to `true`. This ensures that the CSI driver is configured to expose its metrics.

```bash
helm template linode-csi-driver \
--set apiToken="${LINODE_API_TOKEN}" \
--set region="${REGION}" \
--set enable_metrics=true \
helm-chart/csi-driver --namespace kube-system > csi.yaml
```
prajwalvathreya marked this conversation as resolved.
Show resolved Hide resolved

### 2. Delete the Existing Release of the CSI Driver

Before applying the new configuration, you need to delete the current release of the Linode CSI driver. This step is necessary because the default CSI driver installation does not have metrics enabled, and Helm doesn’t handle changes to some components gracefully without a clean reinstall.

```bash
kubectl delete -f csi.yaml --namespace kube-system
```
prajwalvathreya marked this conversation as resolved.
Show resolved Hide resolved

### 3. Apply the Newly Generated Template

Once the old CSI driver installation is deleted, you can apply the newly generated template that includes the metrics configuration.

```bash
kubectl apply -f csi.yaml
```
prajwalvathreya marked this conversation as resolved.
Show resolved Hide resolved

## Steps to Install the Grafana Dashboard

### 1. Build and Set Up the Cluster (Optional)
If you haven’t already set up your Kubernetes cluster with the necessary CSI driver and Prometheus metrics services, you can do so by running the following command:
```bash
make mgmt-and-capl-cluster
```
This command creates a management cluster and CAPL (Cluster API for Linode) cluster, installs the Linode CSI driver, and applies the necessary configurations to expose the CSI metrics.

### 2. Run the Grafana Dashboard Setup
The `grafana-dashboard` make target combines the installation of Prometheus, Grafana, and the dashboard configuration. It ensures that Prometheus is installed and connected to Grafana, and that a pre-configured dashboard is applied. To execute this setup, run:

```bash
make grafana-dashboard
```
prajwalvathreya marked this conversation as resolved.
Show resolved Hide resolved
prajwalvathreya marked this conversation as resolved.
Show resolved Hide resolved

#### What Happens During the Setup?

This target combines three separate make targets:
1. **`install-prometheus`**: Installs Prometheus using a Helm chart in the `monitoring` namespace. Prometheus is configured to scrape metrics from the CSI driver and other services.
2. **`install-grafana`**: Installs Grafana using a Helm chart in the `monitoring` namespace, with Prometheus as its data source.
3. **`setup-dashboard`**: Sets up a pre-configured Grafana dashboard by applying a ConfigMap containing the dashboard JSON (`observability/metrics/dashboard.json`).

### 3. Accessing the Grafana Dashboard

Once the setup is complete, you can access the Grafana dashboard through the configured LoadBalancer service. After the setup script runs, the external IP of the LoadBalancer is printed, and you can access Grafana by opening the following URL in your browser:

```
http://<LoadBalancer-EXTERNAL-IP>
```

Log in using the following credentials:
- Username: `admin`
- Password: `admin`

These credentials can be customized via environment variables in the `install-monitoring-tools.sh` script if needed.

### 4. Stopping the Port Forwarding (if used)

If you are using port forwarding instead of a LoadBalancer, and you wish to stop the forwarding, run:
```bash
kill <PID>
```
Replace `<PID>` with the process ID provided by the script during the setup.

If you do not have access to the script output, run:
```bash
ps -ef | grep 'kubectl port-forward' | grep -v grep
```
This will give you details about the process and also the `PID`.

## Customizing the Setup

- **Namespace**: The default namespace for the observability tools is `monitoring`. You can modify this by passing the `--namespace` flag or editing the `install-monitoring-tools.sh` script and changing the `NAMESPACE` variable.

- **Grafana Dashboard Configuration**: The default dashboard configuration is stored in `observability/metrics/dashboard.json`. To apply a different dashboard, replace the contents of this file before running the `make grafana-dashboard` target.

- **Prometheus Data Source**: The default data source is Prometheus, as defined in the Helm chart configuration. If you wish to use a different data source, modify the `helm upgrade` command in `install-monitoring-tools.sh`.

## Makefile Targets

### `install-prometheus`
Installs Prometheus in the `monitoring` namespace using a Helm chart. Prometheus scrapes metrics from the CSI driver and other services in the cluster.

```bash
make install-prometheus
```

### `install-grafana`
Installs Grafana in the `monitoring` namespace using a Helm chart. Prometheus is set as the data source for Grafana.

```bash
make install-grafana
```

### `setup-dashboard`
Sets up the pre-configured Grafana dashboard by applying a ConfigMap containing the dashboard JSON. This ConfigMap is created from the `observability/metrics/dashboard.json` file.

```bash
make setup-dashboard
```

### `grafana-dashboard`
This is a combined target that installs Prometheus, Grafana, and configures the Grafana dashboard. It runs the `install-prometheus`, `install-grafana`, and `setup-dashboard` targets sequentially.

```bash
make grafana-dashboard
```

## Troubleshooting

If you encounter issues during the installation process, check the logs and status of the Prometheus and Grafana pods:
```bash
kubectl get pods -n monitoring
kubectl logs <prometheus-pod-name> -n monitoring
kubectl logs <grafana-pod-name> -n monitoring
```

This setup provides a quick and easy way to enable observability using Grafana dashboards, ensuring that you have visibility into your Kubernetes cluster and CSI driver operations.

---

This updated documentation reflects the newly structured make targets for easier installation and management of Prometheus, Grafana, and the dashboard configuration. Let me know if you'd like further adjustments!
Loading
Loading