Skip to content

ntuspeechlab/k8s-cloud-fyp

Repository files navigation

Setup

The following was deployed within a Minikube VM with 4 CPUs and 8GB RAM specs

  1. Namespace
  2. Prometheus Operator
  3. MinIO
  4. Thanos
  5. Decoding SDK
  6. Log Parser
  7. Loki
  8. Grafana
  9. Alertmanager

Namespace

Create monitoring, minio and decoding-sdk namespaces:

kubectl apply -f namespace.yaml

Prometheus Operator

Switch to monitoring namespace:

kubectl config set-context --current --namespace=monitoring

Create Prometheus Operator Custom Resource Definitions (CRDs):

kubectl create -f prometheus-operator-crds

Apply Prometheus Operator folder:

kubectl apply -R -f prometheus-operator

When Prometheus Operator is up, apply Prometheus folder:

kubectl apply -f prometheus

Checkpoint 1

Port forward Prometheus Operator service:

kubectl port-forward svc/prometheus-operated 9090

Visit localhost:9090 and navigate to Status > Targets

You should see 1 active Service Monitor target with 2 endpoints as shown below:

image

*Note: We see two endpoints here because the prometheus/prometheus.yaml file currently specifies 2 replicas

If the targets cannot be seen, please redo the Prometheus Operator steps and make sure that K8s objects created when applying the prometheus-operator folder are up before applying the prometheus folder

MinIO

Switch to minio namespace:

kubectl config set-context --current --namespace=minio

Apply MinIO folder:

kubectl apply -f minio

Visit MinIO on cluster-ip:30001 and login with username minioadmin and password minioadmin

*Note: username and passsword is configured in minio/secrets.yaml

Go to Access Key > Create access key and click on Create

Minio Access Key ss

A one-time popup showing the Access Key and Secret Key will appear. Please copy the keys somewhere for now

Go to Buckets, enter the Bucket Name as prometheus-metrics and click on Create Bucket:

Minio Create Bucket

Update the placeholders in thanos/store-gateway/objectstore.yaml with your Access Key and Secret Key

Checkpoint 2

Visit Prometheus Operator on localhost:9090 and there should be a new MinIO Service Monitor target added as shown below:

image

Thanos

Switch to monitoring namespace:

kubectl config set-context --current --namespace=monitoring

Apply Thanos folder:

kubectl apply -R -f thanos

Run kubectl get all and wait till the all the objects are up. The monitoring namespace should look something similar to this:

image

*Note: If the storegateway pod is failing to start, make sure the bucket name in prometheus/objectstore.yaml matches the bucket name that was created in MinIO. Also make sure that the access and secret keys match the ones you created earlier. If you forgot to save the access and secret keys earlier, just create a new pair and update prometheus/objectstore.yaml accordingly

Checkpoint 3

Port forward Thanos Querier service:

kubectl port-forward svc/querier 9090
OR
# If your Prometheus Operator is still running on 9090, port forward to a separate port instead
kubectl port-forward svc/querier 9091:9090

Visit localhost:9090 and navigate to Stores

You should see 1 Thanos Receiver and 1 Thanos Store Gateway that are both up:

image

Navigate to Graph and try to query for prometheus_http_requests_total metric:

image

If this works it means that the Thanos Querier is successfully retrieving metrics from the Thanos Receiver!

*Note: Thanos Receiver by default uploads to the MinIO bucket every 2 hours. To test whether the Thanos Querier is successfully retrieving metrics from the Thanos Store Gateway, just observe whether your are able to query for metrics older than 2 hours

Decoding SDK

If you have already deployed the decoding sdk server and worker, you can skip this step

Else, you can deploy them by running:

kubectl apply -f decoding-sdk

The server and worker will be deployed in the decoding-sdk namespace

If you are on Minikube, you will need to mount the models folder locally by running:

# Replace your-models-folder-path accordingly
minikube mount your-models-folder-path:/opt/models

*Note: The decoding-sdk/pv.yaml file hostPath path value is set to /opt/models. If the models are in a separate directory, please change the path value accordingly.

Log Parser

The Log Parser scrapes logs from the decoding sdk server and worker pods. It implements custom logic to parse the logs and export custom prometheus metrics

Since the log parser is deployed in the monitoring namespace and needs to scrape pod logs in the decoding-sdk namespace, it needs a service account with the necessary cluster role permissions

Apply the Log Parser folder:

kubectl apply -R -f log-parser

Checkpoint 4

To test if the Log Parser is exporting metrics successfully, we need to send some dummy requests to the decoding sdk server

Create and activate virtual Python env (optional but recommended):

python -m venv venv
source venv/bin/activate

Install python dependencies and run audio file:

pip install -r requirements.txt
# Replace cluster-ip with your K8s cluster's ip address
python client_sdk_v2.py -u ws://cluster-ip:30080/abx/ws/speech -m Abax_English_ASR_0822 audio-files/countries.wav

If u see the text countries appended to test.log file, it means that the request was successful

Port forward Log Parser service:

kubectl port-forward svc/log-parser-service 8080

Visit the Log Parser on localhost:8080, scroll down and you should see the metrics being populated with values:

image

Loki

To setup Loki, we will leverage on the grafana/loki-stack helm chart to simplify the deployment process. We will be supplying the helm chart with a custom values.yaml file which will only enable Loki and Promtail

Add Grafana helm repo:

helm repo add grafana https://grafana.github.io/helm-charts

Install grafana/loki-stack helm chart with custom values:

helm install --namespace=monitoring --values loki/values.yaml loki grafana/loki-stack

We can create custom labels from the logs that Promtail scrapes, in this case we want the status from the response object as a custom label

To do that we need to delete the existing Promtail config secret and provide our custom Promtail config secret:

kubectl delete secrets loki-promtail
kubectl create secret generic loki-promtail --from-file=./loki/promtail.yaml

Reload Promtail by deleting the Promtail pod:

# Your pod name suffix would probably be different
kubectl delete pod/loki-promtail-brb97

Checkpoint 5

Ensure that both the Loki & Promtail pods are up and running:

Loki Promtail Pod

Grafana

Create Grafana folder needed by Persistent Volume:

# Cloud
mkdir /tmp/grafana-pv

# Minikube
# Replace your-grafana-folder-path accordingly
minikube mount your-grafana-folder-path:/tmp/grafana-pv

*Note: If you want to use a different path, update hostPath path value in grafana/pv.yaml

Apply Grafana folder:

kubectl apply -f grafana

Port forward Grafana service:

kubectl port-forward svc/grafana 3000

Visit Grafana on localhost:3000 and login with username admin and password admin

Adding Thanos Querier data source:

Screencast.from.2024-04-28.11-40-52.webm

Adding Loki data source:

Screencast.from.2024-04-28.19-52-10.webm

As seen in the video, an error appears after clicking Save and Test button. This seems to be a bug and we can just ignore this. Just make sure that you can query the decoding sdk server and worker logs with Loki (shown in the second part of the video)

Copy JSON from grafana/dashboards/decoding-sdk-dashboard.json and import Decoding SDK dashboard:

Screencast.from.2024-04-28.19-45-18.webm

Copy JSON from grafana/dashboards/minio-dashboard.json and import MinIO dashboard:

[Screencast from 2024-04-28 20-03-14.webm](https://github.com/Niflnir/k8s-cloud-fyp/assets/70419463/c7cfed07-4071-40cf-902c-075efe0b984e

Alertmanager

For the alertmanager, the setup currently uses gmail as the alert notifier. If you would like to use a separate software to notify your alerts you will have to configure separately.

Before we apply the Alertmanager folder, we need to fill in the placeholders in alertmanager/alertmanagerconfig.yaml:

  1. Replace <your-gmail> placeholder with your gmail address
  2. Create a Google app password via https://myaccount.google.com/apppasswords and replace the <your-password> placeholder with the app password

Apply Alertmanager folder

kubectl apply -f alertmanager

There are currently 4 rules defined in prometheus/rules.yaml that are grouped by severity.

Severity=Critical group:

  • InstanceDown (Immediately fire if any instance is down)

Severity=Moderate group:

  • HighRequestFailure (More than 10 failed requests in the past 5 minutes)
  • HighRequestLatency (50th Percentile of latency is more than 1 second)
  • HighRealTimeFactor (50th Percentile of RTF is more than 1.5)

These rules were applied earlier when applying the Prometheus folder. You can see these rules by visiting Prometheus Operator and navigating to Status > Rules:

image

When a rule is triggered, it will fire an alert and you will receive an email containing the rules that were triggered:

image

Thanks for reading!! ☺️ ❤️

Releases

No releases published

Packages

No packages published

Languages