The following was deployed within a Minikube VM with 4 CPUs and 8GB RAM specs
Create monitoring
, minio
and decoding-sdk
namespaces:
kubectl apply -f namespace.yaml
Switch to monitoring
namespace:
kubectl config set-context --current --namespace=monitoring
Create Prometheus Operator Custom Resource Definitions (CRDs):
kubectl create -f prometheus-operator-crds
Apply Prometheus Operator folder:
kubectl apply -R -f prometheus-operator
When Prometheus Operator is up, apply Prometheus folder:
kubectl apply -f prometheus
Port forward Prometheus Operator service:
kubectl port-forward svc/prometheus-operated 9090
Visit localhost:9090
and navigate to Status
> Targets
You should see 1 active Service Monitor target with 2 endpoints as shown below:
*Note: We see two endpoints here because the prometheus/prometheus.yaml
file currently specifies 2 replicas
If the targets cannot be seen, please redo the Prometheus Operator steps and make sure that K8s objects created when applying the prometheus-operator
folder are up before applying the prometheus
folder
Switch to minio
namespace:
kubectl config set-context --current --namespace=minio
Apply MinIO folder:
kubectl apply -f minio
Visit MinIO on cluster-ip:30001
and login with username minioadmin
and password minioadmin
*Note: username and passsword is configured in minio/secrets.yaml
Go to Access Key
> Create access key
and click on Create
A one-time popup showing the Access Key
and Secret Key
will appear. Please copy the keys somewhere for now
Go to Buckets
, enter the Bucket Name as prometheus-metrics
and click on Create Bucket
:
Update the placeholders in thanos/store-gateway/objectstore.yaml
with your Access Key
and Secret Key
Visit Prometheus Operator on localhost:9090
and there should be a new MinIO Service Monitor target added as shown below:
Switch to monitoring
namespace:
kubectl config set-context --current --namespace=monitoring
Apply Thanos folder:
kubectl apply -R -f thanos
Run kubectl get all
and wait till the all the objects are up. The monitoring
namespace should look something similar to this:
*Note: If the storegateway pod is failing to start, make sure the bucket name in prometheus/objectstore.yaml
matches the bucket name that was created in MinIO. Also make sure that the access and secret keys match the ones you created earlier. If you forgot to save the access and secret keys earlier, just create a new pair and update prometheus/objectstore.yaml
accordingly
Port forward Thanos Querier service:
kubectl port-forward svc/querier 9090
OR
# If your Prometheus Operator is still running on 9090, port forward to a separate port instead
kubectl port-forward svc/querier 9091:9090
Visit localhost:9090
and navigate to Stores
You should see 1 Thanos Receiver and 1 Thanos Store Gateway that are both up:
Navigate to Graph
and try to query for prometheus_http_requests_total
metric:
If this works it means that the Thanos Querier is successfully retrieving metrics from the Thanos Receiver!
*Note: Thanos Receiver by default uploads to the MinIO bucket every 2 hours. To test whether the Thanos Querier is successfully retrieving metrics from the Thanos Store Gateway, just observe whether your are able to query for metrics older than 2 hours
If you have already deployed the decoding sdk server and worker, you can skip this step
Else, you can deploy them by running:
kubectl apply -f decoding-sdk
The server and worker will be deployed in the decoding-sdk
namespace
If you are on Minikube, you will need to mount the models folder locally by running:
# Replace your-models-folder-path accordingly
minikube mount your-models-folder-path:/opt/models
*Note: The decoding-sdk/pv.yaml
file hostPath
path value is set to /opt/models
. If the models are in a separate directory, please change the path value accordingly.
The Log Parser scrapes logs from the decoding sdk server and worker pods. It implements custom logic to parse the logs and export custom prometheus metrics
Since the log parser is deployed in the monitoring
namespace and needs to scrape pod logs in the decoding-sdk
namespace, it needs a service account with the necessary cluster role permissions
Apply the Log Parser folder:
kubectl apply -R -f log-parser
To test if the Log Parser is exporting metrics successfully, we need to send some dummy requests to the decoding sdk server
Create and activate virtual Python env (optional but recommended):
python -m venv venv
source venv/bin/activate
Install python dependencies and run audio file:
pip install -r requirements.txt
# Replace cluster-ip with your K8s cluster's ip address
python client_sdk_v2.py -u ws://cluster-ip:30080/abx/ws/speech -m Abax_English_ASR_0822 audio-files/countries.wav
If u see the text countries
appended to test.log file, it means that the request was successful
Port forward Log Parser service:
kubectl port-forward svc/log-parser-service 8080
Visit the Log Parser on localhost:8080
, scroll down and you should see the metrics being populated with values:
To setup Loki, we will leverage on the grafana/loki-stack
helm chart to simplify the deployment process. We will be supplying the helm chart with a custom values.yaml
file which will only enable Loki and Promtail
Add Grafana helm repo:
helm repo add grafana https://grafana.github.io/helm-charts
Install grafana/loki-stack helm chart with custom values:
helm install --namespace=monitoring --values loki/values.yaml loki grafana/loki-stack
We can create custom labels from the logs that Promtail scrapes, in this case we want the status
from the response object as a custom label
To do that we need to delete the existing Promtail config secret and provide our custom Promtail config secret:
kubectl delete secrets loki-promtail
kubectl create secret generic loki-promtail --from-file=./loki/promtail.yaml
Reload Promtail by deleting the Promtail pod:
# Your pod name suffix would probably be different
kubectl delete pod/loki-promtail-brb97
Ensure that both the Loki & Promtail pods are up and running:
Create Grafana folder needed by Persistent Volume:
# Cloud
mkdir /tmp/grafana-pv
# Minikube
# Replace your-grafana-folder-path accordingly
minikube mount your-grafana-folder-path:/tmp/grafana-pv
*Note: If you want to use a different path, update hostPath path value in grafana/pv.yaml
Apply Grafana folder:
kubectl apply -f grafana
Port forward Grafana service:
kubectl port-forward svc/grafana 3000
Visit Grafana on localhost:3000
and login with username admin
and password admin
Adding Thanos Querier data source:
- Name: Thanos Querier
- Prometheus server URL: http://querier.monitoring.svc.cluster.local:9090
Screencast.from.2024-04-28.11-40-52.webm
Adding Loki data source:
- Name: Loki
- URL: http://loki:3100
Screencast.from.2024-04-28.19-52-10.webm
As seen in the video, an error appears after clicking Save and Test button. This seems to be a bug and we can just ignore this. Just make sure that you can query the decoding sdk server and worker logs with Loki (shown in the second part of the video)
Copy JSON from grafana/dashboards/decoding-sdk-dashboard.json
and import Decoding SDK dashboard:
Screencast.from.2024-04-28.19-45-18.webm
Copy JSON from grafana/dashboards/minio-dashboard.json
and import MinIO dashboard:
[Screencast from 2024-04-28 20-03-14.webm](https://github.com/Niflnir/k8s-cloud-fyp/assets/70419463/c7cfed07-4071-40cf-902c-075efe0b984e
For the alertmanager, the setup currently uses gmail as the alert notifier. If you would like to use a separate software to notify your alerts you will have to configure separately.
Before we apply the Alertmanager folder, we need to fill in the placeholders in alertmanager/alertmanagerconfig.yaml
:
- Replace
<your-gmail>
placeholder with your gmail address - Create a Google app password via
https://myaccount.google.com/apppasswords
and replace the<your-password>
placeholder with the app password
Apply Alertmanager folder
kubectl apply -f alertmanager
There are currently 4 rules defined in prometheus/rules.yaml
that are grouped by severity.
Severity=Critical group:
- InstanceDown (Immediately fire if any instance is down)
Severity=Moderate group:
- HighRequestFailure (More than 10 failed requests in the past 5 minutes)
- HighRequestLatency (50th Percentile of latency is more than 1 second)
- HighRealTimeFactor (50th Percentile of RTF is more than 1.5)
These rules were applied earlier when applying the Prometheus folder. You can see these rules by visiting Prometheus Operator and navigating to Status > Rules:
When a rule is triggered, it will fire an alert and you will receive an email containing the rules that were triggered:
Thanks for reading!!