Document the new ci-monitoring stack (#34823)

* Document the new ci-monitoring stack * Update clusters/app.ci/ci-grafana/README.md Co-authored-by: Bruno Barcarol Guimarães <[email protected]> * Update clusters/app.ci/openshift-user-workload-monitoring/README.md Co-authored-by: Bruno Barcarol Guimarães <[email protected]> Co-authored-by: Bruno Barcarol Guimarães <[email protected]>
openshift · Dec 15, 2022 · b21c662 · b21c662
1 parent eced59f
commit b21c662
Show file tree

Hide file tree

Showing 2 changed files with 41 additions and 0 deletions.
diff --git a/clusters/app.ci/ci-grafana/README.md b/clusters/app.ci/ci-grafana/README.md
@@ -0,0 +1,24 @@
+# CI-Grafana
+
+This folder contains the manifests for Grafana managed by [grafana-operator](https://github.com/grafana-operator/grafana-operator).
+The grafana-operator is installed via [Operator Hub](https://console-openshift-console.apps.ci.l2s4.p1.openshiftapps.com/operatorhub) into `namespace/ci-grafana` and managed by operator-lifecycle-manager (OLM).
+
+## Dashboards
+
+The dashboards for Grafana are generated from [mixins](../openshift-user-workload-monitoring/mixins) with the command:
+
+> make -C clusters/app.ci/openshift-user-workload-monitoring/mixins all
+
+The generated dashboards are stored in [mixins/grafana_dashboards_out](../openshift-user-workload-monitoring/mixins/grafana_dashboards_out).
+The `jsonnet` objects are there because it is easy for validation in CI if all the mixins generated manifests stay together.
+
+## Staging
+
+We do not have a staging grafana instance for developing dashboards any more.
+With grafana-operator, we could apply the generated dashboard to preview the dashboard with the production instance.
+
+The current grafana-operator, 4.8.0 as this readme is written, manages only one grafana instance.
+We have to [deploy everything all over again](https://kubernetes.slack.com/archives/C019A1KTYKC/p1670534010925499) for staging instance into another namespace
+which will be fixed when Version 5+ is available.
+We can start up the staging instance then if needed.
+
diff --git a/clusters/app.ci/openshift-user-workload-monitoring/README.md b/clusters/app.ci/openshift-user-workload-monitoring/README.md
@@ -0,0 +1,17 @@
+# Metrics and Alerts
+
+This folder contains the manifests for user-workload-monitoring (UWM) based on [monitoring user-defined project on OSD cluster](https://docs.openshift.com/dedicated/osd_cluster_admin/osd_monitoring/osd-understanding-the-monitoring-stack.html) and managed by cluster-monitoring-operator (CMO).
+
+The ServiceMonitors and the PodMonitors defines the scraping targets for Prometheus. The cluster console has the UI to run queries against the metrics from those targets.
+
+The alerts are generated by [mixins](openshift-user-workload-monitoring/mixins/) with the following command:
+
+> make -C clusters/app.ci/openshift-user-workload-monitoring/mixins all
+
+The generated manifests are stored in [mixins/prometheus_out](mixins/prometheus_out).
+
+[Here](../supplemental-ci-images/validation-images/dashboards/dashboards-validation.yaml) is the list of the required tools.
+
+# Add an alert on Prow jobs
+
+The metrics and the alerts defined here are for the TP team. CI users has [a more convenient way](https://docs.ci.openshift.org/docs/how-tos/notification/) if slack notifications are desired, e.g, on Prow job failures.