(fleet/prometheus rules) GitOps prom rules #346

fbegyn · 2024-05-03T10:02:12Z

People can add prometheus rule files to the /rules directory and these will be deployed into the correct cluster/site.

DISCUSS:
It cloud be that we want overwrites per cluster to be able to tune and specify alerts into various parts.

fbegyn · 2024-05-03T10:23:18Z

Not completely sure about that markdownlint error, running it locally it seems OK. @jhoblitt any ideas as to what triggers it there?

fleet/lib/kube-prometheus-stack/aggregator/values.yaml

jhoblitt

The "WIP" commit message shouldn't be merged to master.

jhoblitt

LGTM.

fbegyn · 2024-06-07T07:51:28Z

I've rebased the current working branch and cleaned up some of the strange commits @jhoblitt . This branch should be considered stable and final now. Any other changes should be merged from other branches once this one is in.

jhoblitt · 2024-06-07T21:35:54Z

docs/alerts/README.md

+## Prometheus rule AURA standards
+
+* `summary` annotation: The `summary` annotation is used to be able to describe a
+  group of alerts incomming. This annotation DOES NOT contain any templated


Why shouldn't templated values be in the summary?

I've written this quite globally for all kinds of levels of Prometheus alerting understanding. The tricky part is that you want the summary to be more or less the same for all the alerts in the same grouping. The risk of simply allowing templated variables in the summary is that each summary will be unique, making alert grouping less useful IMHO.

Not really a technical limitation, more dependent on the "alert standard".

I am having trouble coming up with a scenario where the summary would need to be used for grouping instead of the alertname. I think that having interpolated values in the summary is useful information. Alert manager's grouping is rather primitive and in most cases, we don't want any grouping unless there is a high alert volume, which alertmanager isn't capable of doing. I suspect we will end up shipping everything to squadcast ungrouped anyways.

I'll adjust the config and standard to that use case then (no grouping unless sepcified). In that use case the summary can really be templated as wanted.

In what I described, the summary wouldn't really be used to group. Say you group on alertname,cluster. The summary could then be used to provide short information bursed about the alert, without overloading the alert message with specific details if 25 pods are crashing in a single cluster. The only "useful" template value then used on the summary would be cluster really, so it's sometimes easier to avoid them completely.

However if grouping is the exception, rather then the norm (as would be the case if alerts are send to subsequent systems), the summary can be formatted however. Although in that case I'd argue that the difference between a summary/description is neglectable and you'd might as well just use the description.

jhoblitt · 2024-06-07T21:37:50Z

fleet/lib/kube-prometheus-stack-pre/overlays/ayekan/externalsecret-lsst-webhooks.yaml

@@ -16,3 +16,7 @@ spec:
      remoteRef:
        key: squadcast prometheus service
        property: credential
+    - secretKey: kafka
+      remoteRef:
+        key: alertmanager-kafka-credentials


What uses this?

I still don't see any use of this credential in this PR. Is in intentionally part of this PR?

fleet/lib/kube-prometheus-stack/aggregator/values.yaml

fleet/lib/prometheus-alertrules/README.md

jhoblitt · 2024-06-07T21:56:39Z

fleet/lib/prometheus-alertrules/rules/ceph.yaml

@@ -0,0 +1,169 @@
+groups:
+  - name: "ceph.rules"


At a glance, these rules are different from what is in the prometheus-ceph-rules promtheusrules that's currently being installed by the rook-ceph-cluster chart. Is the intent to disable what's being provided by the chart? It looks like the chart rules cover more functionality. What is in these rules that wasn't covered by what comes in the chart?

See also: #346 (comment)

My general goal/intention is to have the charts still deploy the alerts to the cluster, so they become visible in Prometheus rule interface (on http://prometheus/alerts) and Alertmanager (on http://alertmanager), but not route the alerts. A lot of helm generally only allow for "all or nothing" label modifications for alert routing.

Having them present in the web interfaces allows us to check which alerts we care about and are actually doing stuff, after which we can duplicate them into this fleet bundle to deploy and modify them to match our own setup.

The modification is sometimes needed because the helm chart rules are unaware of the deployed architecture of Prometheus/Mimir. In our case, the prom_cluster external label for example.

I'm not following w.r.t. the prom_cluster label. Under the current architecture of prometheus evaluating rules on each k8s cluster, the prom_cluster label does not exist at the time of evaluation as its an external label applied as part of the remote write to mimir. Is this in preparation to switch to using mimir ruler and evaluating rules centrally?

https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/#templating suggests that $externalLabels, separate from $labels, is accessible during rule evaluation. I've been unable to find an example of it being used.

fleet/lib/prometheus-alertrules/rules/ceph.yaml

fleet/lib/prometheus-alertrules/fleet.yaml

fleet/lib/prometheus-alertrules/rules/nodes.yaml

jhoblitt · 2024-06-10T23:21:51Z

fleet/lib/kube-prometheus-stack/values.yaml

@@ -121,6 +119,10 @@ coreDns:
  serviceMonitor:
    additionalLabels:
      lsst.io/monitor: "true"
+kubeDns:


This was removed because we don't use kubeDns. #445

jhoblitt · 2024-06-10T23:24:56Z

fleet/lib/kube-prometheus-stack-pre/overlays/ayekan/externalsecret-lsst-webhooks.yaml

@@ -16,3 +16,7 @@ spec:
      remoteRef:
        key: squadcast prometheus service
        property: credential
+    - secretKey: kafka
+      remoteRef:
+        key: alertmanager-kafka-credentials


I still don't see any use of this credential in this PR. Is in intentionally part of this PR?

jhoblitt · 2024-06-10T23:40:53Z

fleet/lib/prometheus-alertrules/rules/nodes.yaml

+            Nodes {{ $labels.instance }} disk is currently almost full at {{ $value }}. It will fill up within 6 hours.
+        expr: |
+          (
+                        node_filesystem_avail_bytes{fstype!="",job="node-exporter"}


Matching on the job label means this rule won't apply to puppetdb sd nodes, which is around 2/3rds of our fleet, I think we don't care which job created the metric.

jhoblitt · 2024-06-11T00:00:27Z

fleet/lib/prometheus-alertrules/rules/ceph.yaml

@@ -0,0 +1,169 @@
+groups:
+  - name: "ceph.rules"


I'm not following w.r.t. the prom_cluster label. Under the current architecture of prometheus evaluating rules on each k8s cluster, the prom_cluster label does not exist at the time of evaluation as its an external label applied as part of the remote write to mimir. Is this in preparation to switch to using mimir ruler and evaluating rules centrally?

This allows for prometheus rules to be automagically applied through rancher fleet. Simply adding some rules to the fleet directory will deploy them into the cluster, ready for the prometheus operator to pick them up and deploy. Documentation to be found in the `docs/alerts` directory. (fleet/lib/prometheusrules) add values file (fleet/prometheus rules) depends on prometheus-crds

jhoblitt · 2024-06-18T00:25:18Z

I have split the prometheus alerts chart out as #498

Use the correct key for the secret instead of the older one.

…o deploy our own

jhoblitt

It seems that one of the ceph queries isn't working. I don't understand why .* isn't working as wildcard match.

jhoblitt · 2024-08-30T22:35:39Z

fleet/lib/prometheus-alertrules/rules/ceph.yaml

+        expr: |
+          (ceph_pool_stored/ceph_pool_quota_bytes > 0.9 and ceph_pool_quota_bytes != 0)*100
+      - alert: CephTargetDown
+        expr: up{job=".*ceph.*"} == 0


The wildcard matching does not seem to be working with prometheus on ayekan.

Demonstration that there are up metrics with a label that includes ceph:

However, the expr in this rule doesn't match anything:

fbegyn requested review from jhoblitt, cbarria and csilva-cl May 3, 2024 10:02

fbegyn self-assigned this May 3, 2024

fbegyn changed the title ~~GitOps prom rules~~ (fleet/prometheus rules) GitOps prom rules May 3, 2024

jhoblitt reviewed May 6, 2024

View reviewed changes

fleet/lib/kube-prometheus-stack/aggregator/values.yaml Outdated Show resolved Hide resolved

jhoblitt reviewed May 9, 2024

View reviewed changes

fleet/lib/kube-prometheus-stack/aggregator/values.yaml Outdated Show resolved Hide resolved

jhoblitt reviewed May 10, 2024

View reviewed changes

fbegyn requested a review from jhoblitt May 16, 2024 17:02

fbegyn force-pushed the IT-5303-gitops-prom-rules branch from 145186f to 57a2840 Compare May 17, 2024 08:37

fbegyn force-pushed the IT-5303-gitops-prom-rules branch from b28503a to d265dc6 Compare May 26, 2024 22:18

jhoblitt approved these changes May 28, 2024

View reviewed changes

fbegyn force-pushed the IT-5303-gitops-prom-rules branch from 62bb6da to 0c7253d Compare May 28, 2024 16:01

fbegyn force-pushed the IT-5303-gitops-prom-rules branch from dc85ec3 to b4eb883 Compare June 7, 2024 07:36

fbegyn requested a review from jhoblitt June 7, 2024 07:38

fbegyn force-pushed the IT-5303-gitops-prom-rules branch from b4eb883 to 6724f03 Compare June 7, 2024 07:50

fbegyn force-pushed the IT-5303-gitops-prom-rules branch from 6724f03 to 54e167f Compare June 7, 2024 16:37

jhoblitt requested changes Jun 7, 2024

View reviewed changes

fbegyn force-pushed the IT-5303-gitops-prom-rules branch 3 times, most recently from 6d65fc9 to a49f42d Compare June 10, 2024 09:34

fbegyn requested a review from jhoblitt June 10, 2024 13:14

jhoblitt reviewed Jun 10, 2024

View reviewed changes

jhoblitt requested changes Jun 11, 2024

View reviewed changes

This was referenced Jun 13, 2024

(fleet/grafana) Add opensearch logging datasource to aggregator Grafana #490

Merged

(fleet/prom-stack) Prometheus stack config cleanup #491

Merged

fbegyn added 2 commits June 13, 2024 17:27

(fleet/prometheus) modify alerting stack triggers

0d90c79

fbegyn added 10 commits June 13, 2024 17:27

(fleet/alertrules) remove default receivers and update docs

0885a7c

(fleet/alerts) add ceph alerts

882d4ad

(fleet/alerting) remove ceph override

75ded13

(fleet/alerts) include cluster in slack alerts

e8d6b86

(fleet/alerts) move slack credentials to cluster overlay

b44d94c

(fleet/alerts) remove atomic default from fleet.yaml

78afed4

(fleet/alerts) insert templating values for node disk alerts

9354afa

(fleet/alerts) make yamllint happy

58cd0b1

(fleet/prometheusrule) remove explicit timeout

b7a82b4

(fleet/alerts) remove duplicate docs in favor of top level

62d8d60

fbegyn force-pushed the IT-5303-gitops-prom-rules branch from c817a64 to 62d8d60 Compare June 13, 2024 15:29

(fleet/prom-stack) make yamllint happy

22f5863

jhoblitt mentioned this pull request Jun 18, 2024

(fleet/prometheus-alerts) set up gitops for alert deployment #498

Merged

fbegyn and others added 3 commits July 5, 2024 08:46

Update externalsecret-grafana-keycloak-credentials.yaml

5735726

Use the correct key for the secret instead of the older one.

Disabling default deployment of K8s dashboards on grafana, in order t…

a9e34dc

…o deploy our own

Fixing Yaml

a024f6c

jhoblitt requested changes Aug 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(fleet/prometheus rules) GitOps prom rules #346

(fleet/prometheus rules) GitOps prom rules #346

fbegyn commented May 3, 2024

fbegyn commented May 3, 2024

jhoblitt left a comment

jhoblitt left a comment

fbegyn commented Jun 7, 2024

jhoblitt Jun 7, 2024

fbegyn Jun 10, 2024

jhoblitt Jun 10, 2024

fbegyn Jun 10, 2024

jhoblitt Jun 7, 2024

jhoblitt Jun 10, 2024 •

edited

Loading

jhoblitt Jun 7, 2024

fbegyn Jun 10, 2024

jhoblitt Jun 11, 2024

jhoblitt Jun 11, 2024 •

edited

Loading

jhoblitt Jun 10, 2024

jhoblitt Jun 10, 2024 •

edited

Loading

jhoblitt Jun 10, 2024

jhoblitt Jun 11, 2024

jhoblitt commented Jun 18, 2024

jhoblitt left a comment

jhoblitt Aug 30, 2024

(fleet/prometheus rules) GitOps prom rules #346

Are you sure you want to change the base?

(fleet/prometheus rules) GitOps prom rules #346

Conversation

fbegyn commented May 3, 2024

fbegyn commented May 3, 2024

jhoblitt left a comment

Choose a reason for hiding this comment

jhoblitt left a comment

Choose a reason for hiding this comment

fbegyn commented Jun 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhoblitt Jun 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhoblitt Jun 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhoblitt Jun 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhoblitt commented Jun 18, 2024

jhoblitt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhoblitt Jun 10, 2024 •

edited

Loading

jhoblitt Jun 11, 2024 •

edited

Loading

jhoblitt Jun 10, 2024 •

edited

Loading