Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(fleet/prometheus rules) GitOps prom rules #346

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

fbegyn
Copy link
Contributor

@fbegyn fbegyn commented May 3, 2024

People can add prometheus rule files to the /rules directory and these will be deployed into the correct cluster/site.

DISCUSS:
It cloud be that we want overwrites per cluster to be able to tune and specify alerts into various parts.

@fbegyn fbegyn self-assigned this May 3, 2024
@fbegyn fbegyn changed the title GitOps prom rules (fleet/prometheus rules) GitOps prom rules May 3, 2024
@fbegyn
Copy link
Contributor Author

fbegyn commented May 3, 2024

Not completely sure about that markdownlint error, running it locally it seems OK. @jhoblitt any ideas as to what triggers it there?

Copy link
Member

@jhoblitt jhoblitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "WIP" commit message shouldn't be merged to master.

@fbegyn fbegyn requested a review from jhoblitt May 16, 2024 17:02
@fbegyn fbegyn force-pushed the IT-5303-gitops-prom-rules branch from 145186f to 57a2840 Compare May 17, 2024 08:37
@fbegyn fbegyn force-pushed the IT-5303-gitops-prom-rules branch from b28503a to d265dc6 Compare May 26, 2024 22:18
Copy link
Member

@jhoblitt jhoblitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@fbegyn fbegyn force-pushed the IT-5303-gitops-prom-rules branch from 62bb6da to 0c7253d Compare May 28, 2024 16:01
@fbegyn fbegyn force-pushed the IT-5303-gitops-prom-rules branch from dc85ec3 to b4eb883 Compare June 7, 2024 07:36
@fbegyn fbegyn requested a review from jhoblitt June 7, 2024 07:38
@fbegyn fbegyn force-pushed the IT-5303-gitops-prom-rules branch from b4eb883 to 6724f03 Compare June 7, 2024 07:50
@fbegyn
Copy link
Contributor Author

fbegyn commented Jun 7, 2024

I've rebased the current working branch and cleaned up some of the strange commits @jhoblitt . This branch should be considered stable and final now. Any other changes should be merged from other branches once this one is in.

@fbegyn fbegyn force-pushed the IT-5303-gitops-prom-rules branch from 6724f03 to 54e167f Compare June 7, 2024 16:37
## Prometheus rule AURA standards

* `summary` annotation: The `summary` annotation is used to be able to describe a
group of alerts incomming. This annotation DOES NOT contain any templated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why shouldn't templated values be in the summary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've written this quite globally for all kinds of levels of Prometheus alerting understanding. The tricky part is that you want the summary to be more or less the same for all the alerts in the same grouping. The risk of simply allowing templated variables in the summary is that each summary will be unique, making alert grouping less useful IMHO.

Not really a technical limitation, more dependent on the "alert standard".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having trouble coming up with a scenario where the summary would need to be used for grouping instead of the alertname. I think that having interpolated values in the summary is useful information. Alert manager's grouping is rather primitive and in most cases, we don't want any grouping unless there is a high alert volume, which alertmanager isn't capable of doing. I suspect we will end up shipping everything to squadcast ungrouped anyways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll adjust the config and standard to that use case then (no grouping unless sepcified). In that use case the summary can really be templated as wanted.

In what I described, the summary wouldn't really be used to group. Say you group on alertname,cluster. The summary could then be used to provide short information bursed about the alert, without overloading the alert message with specific details if 25 pods are crashing in a single cluster. The only "useful" template value then used on the summary would be cluster really, so it's sometimes easier to avoid them completely.

However if grouping is the exception, rather then the norm (as would be the case if alerts are send to subsequent systems), the summary can be formatted however. Although in that case I'd argue that the difference between a summary/description is neglectable and you'd might as well just use the description.

@@ -16,3 +16,7 @@ spec:
remoteRef:
key: squadcast prometheus service
property: credential
- secretKey: kafka
remoteRef:
key: alertmanager-kafka-credentials
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What uses this?

Copy link
Member

@jhoblitt jhoblitt Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't see any use of this credential in this PR. Is in intentionally part of this PR?

fleet/lib/kube-prometheus-stack/aggregator/values.yaml Outdated Show resolved Hide resolved
fleet/lib/kube-prometheus-stack/aggregator/values.yaml Outdated Show resolved Hide resolved
fleet/lib/kube-prometheus-stack/aggregator/values.yaml Outdated Show resolved Hide resolved
fleet/lib/prometheus-alertrules/README.md Outdated Show resolved Hide resolved
@@ -0,0 +1,169 @@
groups:
- name: "ceph.rules"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a glance, these rules are different from what is in the prometheus-ceph-rules promtheusrules that's currently being installed by the rook-ceph-cluster chart. Is the intent to disable what's being provided by the chart? It looks like the chart rules cover more functionality. What is in these rules that wasn't covered by what comes in the chart?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also: #346 (comment)

My general goal/intention is to have the charts still deploy the alerts to the cluster, so they become visible in Prometheus rule interface (on http://prometheus/alerts) and Alertmanager (on http://alertmanager), but not route the alerts. A lot of helm generally only allow for "all or nothing" label modifications for alert routing.

Having them present in the web interfaces allows us to check which alerts we care about and are actually doing stuff, after which we can duplicate them into this fleet bundle to deploy and modify them to match our own setup.

The modification is sometimes needed because the helm chart rules are unaware of the deployed architecture of Prometheus/Mimir. In our case, the prom_cluster external label for example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following w.r.t. the prom_cluster label. Under the current architecture of prometheus evaluating rules on each k8s cluster, the prom_cluster label does not exist at the time of evaluation as its an external label applied as part of the remote write to mimir. Is this in preparation to switch to using mimir ruler and evaluating rules centrally?

Copy link
Member

@jhoblitt jhoblitt Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/#templating suggests that $externalLabels, separate from $labels, is accessible during rule evaluation. I've been unable to find an example of it being used.

fleet/lib/prometheus-alertrules/rules/ceph.yaml Outdated Show resolved Hide resolved
fleet/lib/prometheus-alertrules/fleet.yaml Outdated Show resolved Hide resolved
fleet/lib/prometheus-alertrules/rules/nodes.yaml Outdated Show resolved Hide resolved
@fbegyn fbegyn force-pushed the IT-5303-gitops-prom-rules branch 3 times, most recently from 6d65fc9 to a49f42d Compare June 10, 2024 09:34
@fbegyn fbegyn requested a review from jhoblitt June 10, 2024 13:14
@@ -121,6 +119,10 @@ coreDns:
serviceMonitor:
additionalLabels:
lsst.io/monitor: "true"
kubeDns:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was removed because we don't use kubeDns. #445

@@ -16,3 +16,7 @@ spec:
remoteRef:
key: squadcast prometheus service
property: credential
- secretKey: kafka
remoteRef:
key: alertmanager-kafka-credentials
Copy link
Member

@jhoblitt jhoblitt Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't see any use of this credential in this PR. Is in intentionally part of this PR?

Nodes {{ $labels.instance }} disk is currently almost full at {{ $value }}. It will fill up within 6 hours.
expr: |
(
node_filesystem_avail_bytes{fstype!="",job="node-exporter"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matching on the job label means this rule won't apply to puppetdb sd nodes, which is around 2/3rds of our fleet, I think we don't care which job created the metric.

@@ -0,0 +1,169 @@
groups:
- name: "ceph.rules"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following w.r.t. the prom_cluster label. Under the current architecture of prometheus evaluating rules on each k8s cluster, the prom_cluster label does not exist at the time of evaluation as its an external label applied as part of the remote write to mimir. Is this in preparation to switch to using mimir ruler and evaluating rules centrally?

This allows for prometheus rules to be automagically applied through rancher
fleet. Simply adding some rules to the fleet directory will deploy them into the
cluster, ready for the prometheus operator to pick them up and deploy.

Documentation to be found in the `docs/alerts` directory.

(fleet/lib/prometheusrules) add values file

(fleet/prometheus rules) depends on prometheus-crds
@fbegyn fbegyn force-pushed the IT-5303-gitops-prom-rules branch from c817a64 to 62d8d60 Compare June 13, 2024 15:29
@jhoblitt
Copy link
Member

I have split the prometheus alerts chart out as #498

Copy link
Member

@jhoblitt jhoblitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that one of the ceph queries isn't working. I don't understand why .* isn't working as wildcard match.

expr: |
(ceph_pool_stored/ceph_pool_quota_bytes > 0.9 and ceph_pool_quota_bytes != 0)*100
- alert: CephTargetDown
expr: up{job=".*ceph.*"} == 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wildcard matching does not seem to be working with prometheus on ayekan.

Demonstration that there are up metrics with a label that includes ceph:
image

However, the expr in this rule doesn't match anything:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants