Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(fleet/prometheus rules) GitOps prom rules #346

Open
wants to merge 16 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions docs/alerts/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Prometheus rules GitOps

Any Prometheus rules file defined in the
[fleet/lib/prometheus-alertrules/rules](../../prometheus-alertrules/rules)
directory will be deployed to the cluster. It's possible to define a default
namespace in the `values.yaml` file with the `rules.namespace` key.

## Adding Prometheus rules

1. Write the Prometheus rules in a yaml file according to the [prometheus
specification](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/).
1. Add the YAML file to the `/rules` directory
1. Commit

## Prometheus rule AURA standards

* `summary` annotation: The `summary` annotation is used to be able to describe a
group of alerts incomming. This annotation DOES NOT contain any templated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why shouldn't templated values be in the summary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've written this quite globally for all kinds of levels of Prometheus alerting understanding. The tricky part is that you want the summary to be more or less the same for all the alerts in the same grouping. The risk of simply allowing templated variables in the summary is that each summary will be unique, making alert grouping less useful IMHO.

Not really a technical limitation, more dependent on the "alert standard".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having trouble coming up with a scenario where the summary would need to be used for grouping instead of the alertname. I think that having interpolated values in the summary is useful information. Alert manager's grouping is rather primitive and in most cases, we don't want any grouping unless there is a high alert volume, which alertmanager isn't capable of doing. I suspect we will end up shipping everything to squadcast ungrouped anyways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll adjust the config and standard to that use case then (no grouping unless sepcified). In that use case the summary can really be templated as wanted.

In what I described, the summary wouldn't really be used to group. Say you group on alertname,cluster. The summary could then be used to provide short information bursed about the alert, without overloading the alert message with specific details if 25 pods are crashing in a single cluster. The only "useful" template value then used on the summary would be cluster really, so it's sometimes easier to avoid them completely.

However if grouping is the exception, rather then the norm (as would be the case if alerts are send to subsequent systems), the summary can be formatted however. Although in that case I'd argue that the difference between a summary/description is neglectable and you'd might as well just use the description.

variables and provides a simple single sentence summary of what the alert is
about. For example "Disk space full in 24h". When a cluster triggers several
alerts, it can be hany to group these alerts into a single notification, this
is when the `summary` can be used.
* `discription` annotation: This provides a detailed overview of the alert
specifically to this instance of the alert. It MAY contain templated variables
to enrich the message.
* `receiver` label: The receiver label is used by alertmanager to decide on the
routing of the notification for the alert. It exists out of `,` seperated list
of receivers, pre- and suffixed with `,` to make regex matching easier in the
alertmanager. For example: `,slack,squadcast,email,` The receivers are defined
in the alertmanager configuration.
Original file line number Diff line number Diff line change
Expand Up @@ -5,25 +5,25 @@ metadata:
data:
slack-generic-alert.tmpl: |
{{ define "slack.o11y.generic.text" }}
*Site:* {{ .CommonLabels.site }}
*Site:* {{ .CommonLabels.prom_site }}
*Alert:* {{ .GroupLabels.alertname }}
*Summary:* {{ .CommonAnnotations.summary }}
{{ template "__o11y_alert_list" . }}
{{ template "__o11y_alert_short_list" . }}
{{ end }}
{{ define "slack.o11y.generic.title"}}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.prom }}/{{ .GroupLabels.alertname }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.prom_cluster }}/{{ .GroupLabels.alertname }}
{{ end }}
slack-kube-alert.tmpl: |
{{ define "slack.o11y.kube.text" }}
*Alert:* {{ .GroupLabels.alertname }}
*Site:* {{ .CommonLabels.site }}
*Kube cluster:* {{ .CommonLabels.prom }}
*Site:* {{ .CommonLabels.prom_site }}
*Kube cluster:* {{ .CommonLabels.prom_cluster }}
*Namespace:* {{ .GroupLabels.namespace }}
*Summary:* {{ .CommonAnnotations.summary }}
{{ template "__o11y_alert_list" . }}
{{ end }}
{{ define "slack.o11y.kube.title"}}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.prom }}/{{ .GroupLabels.namespace }}/{{ .GroupLabels.alertname }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.prom_cluster }}/{{ .GroupLabels.namespace }}/{{ .GroupLabels.alertname }}
{{ end }}
slack-network-alert.tmpl: |
{{ define "slack.o11y.network.text" }}
Expand All @@ -36,12 +36,13 @@ data:
{{ template "__o11y_alert_list" . }}
{{ end }}
template-helpers.tmpl: |
{{ define "__o11y_alert_title" }}
{{ end }}
{{ define "__o11y_alert_list" }}
*Alerts:*
=========
{{ range .Alerts -}}
- *Alert:* {{ .Labels.alertname }}
*Summary:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Severity:* {{ .Labels.severity }}
*Time:* {{ .StartsAt.Format "2006-01-02 15:04:05 MST" }}
Expand All @@ -51,3 +52,13 @@ data:
{{ end }}
{{ end }}
{{ end }}
{{ define "__o11y_alert_short_list" }}
*Alerts:*
=========
{{ range .Alerts -}}
- *Alert:* {{ .Labels.alertname }}
*Description:* {{ .Annotations.description }}
*Severity:* {{ .Labels.severity }}
*Time:* {{ .StartsAt.Format "2006-01-02 15:04:05 MST" }}
{{ end }}
{{ end }}
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,4 @@ spec:
- secretKey: keycloak_url
remoteRef:
key: *item
property: hostname
property: url
21 changes: 21 additions & 0 deletions fleet/lib/kube-prometheus-stack/aggregator/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,27 @@ alertmanager:
- secretName: tls-alertmanager-ingress
hosts:
- alertmanager.${ .ClusterName }.${ .ClusterLabels.site }.lsst.org
config:
global:
resolve_timeout: 5m
inhibit_rules:
jhoblitt marked this conversation as resolved.
Show resolved Hide resolved
- source_matchers:
- alertname = "InfoInhibitor"
target_matchers:
- severity = "info"
equal: [namespace]
- source_matchers:
- severity = "critical"
target_matchers:
- severity =~ "info|warning"
equal: [alertname]
- source_matchers:
- severity = "warning"
target_matchers:
- severity = "info"
equal: [alertname]
templates:
- /etc/alertmanager/configmaps/alertmanager-templates/*.tmpl

grafana:
enabled: true
Expand Down
39 changes: 29 additions & 10 deletions fleet/lib/kube-prometheus-stack/overlays/ayekan/values.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
---
prometheus:

prometheusSpec:
configMaps:
- sd-snmp-network
Expand Down Expand Up @@ -180,7 +181,6 @@ alertmanager:
- lsst-webhooks
config:
global:
resolve_timeout: 5m
slack_api_url_file: /etc/alertmanager/secrets/lsst-webhooks/slack-test
route:
group_by: [alertname, namespace, site]
Expand All @@ -201,15 +201,32 @@ alertmanager:
continue: true
- receiver: slack-kube-test
matchers:
- alertname =~ "Kube.*"
- receiver: slack-node-test
group_by: [instance]
matchers:
- alertname =~ "Node.*"
- receiver: slack-network-test
group_by: [instance]
matchers:
- alertname =~ "Network.*"
- receivers =~ ".*,slack,.*"
continue: true
routes:
- receiver: slack-kube-test
matchers:
- alertname =~ "Kube.*"
- receiver: slack-node-test
group_by: [instance]
matchers:
- alertname =~ "Node.*"
- receiver: slack-network-test
group_by: [instance]
matchers:
- alertname =~ "Network.*"
# Below is an example for the namespace based alert routing.
# This will send alerts from a namespace to the namespace specific team
# on slack
# - receiver: slack-rook-ceph-team
# matchers:
# - namespace = "rook-ceph"
# Below is an example for the group based alert routing.
# This will send alerts with a specifc group in the receiver list to the
# alert channel.
# - receiver: email-group
# matchers:
# - receivers =~ ".*,group,.*"
receivers:
- name: "null"
- name: watchdog
Expand Down Expand Up @@ -260,3 +277,5 @@ alertmanager:
equal: [alertname]
templates:
- /etc/alertmanager/configmaps/alertmanager-templates/*.tmpl
grafana:
defaultDashboardsEnabled: false
5 changes: 5 additions & 0 deletions fleet/lib/prometheus-alertrules/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
apiVersion: v2
appVersion: 0.1.0
description: Prometheus LSST rules GitOps
name: lsst-prometheus-alerts
version: 0.1.1
14 changes: 14 additions & 0 deletions fleet/lib/prometheus-alertrules/fleet.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
---
defaultNamespace: &name lsst-prometheus-alerts
labels:
bundle: *name
namespaceLabels:
lsst.io/discover: "true"
helm:
releaseName: *name
takeOwnership: true
waitForJobs: false
dependsOn:
- selector:
matchLabels:
bundle: prometheus-operator-crds
164 changes: 164 additions & 0 deletions fleet/lib/prometheus-alertrules/rules/ceph.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
groups:
- name: ceph.rules
rules:
- alert: CephQuotaFillingUp
annotations:
summary: The Ceph pool quota in cluster {{ $labels.prom_cluster }} is almost full
description: |
Ceph pool id {{ $labels.pool_id }} on {{ $labels.prom_cluster }}/ {{
$labels.namespace }}/{{ $labels.pod }} is at {{ $value }}%. Please
keep in mind that ceph pools reaching 100% is dangerous.
labels:
secverity: warning
receivers: ",slack,"
expr: |
(ceph_pool_stored/ceph_pool_quota_bytes > 0.75 and ceph_pool_quota_bytes != 0)*100
- alert: CephQuotaFillingUp
annotations:
summary: The Ceph pool quota is almost full
description: |
Ceph pool id {{ $labels.pool_id }} on {{ $labels.prom_cluster }}/ {{
$labels.namespace }}/{{ $labels.pod }} is at {{ $value }}%. Please
keep in mind that ceph pools reaching 100% is dangerous.
labels:
secverity: critical
receivers: ",slack,"
expr: |
(ceph_pool_stored/ceph_pool_quota_bytes > 0.9 and ceph_pool_quota_bytes != 0)*100
- alert: CephTargetDown
expr: up{job=".*ceph.*"} == 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wildcard matching does not seem to be working with prometheus on ayekan.

Demonstration that there are up metrics with a label that includes ceph:
image

However, the expr in this rule doesn't match anything:

image

for: 10m
labels:
severity: critical
receivers: ",slack,"
annotations:
description: |
CEPH target on {{ $labels.prom_cluster }} down for more than 2m,
please check - it could be a either exporter crash or a whole cluster
crash
summary: CEPH exporter down on {{ $labels.prom_cluster }}
- alert: CephErrorState
expr: ceph_health_status > 1
for: 5m
labels:
severity: critical
receivers: ",slack,"
annotations:
description: |
Ceph is in Error state on {{ $labels.prom_cluster }} for longer than
5m, please check status of pools and OSDs
summary: CEPH in ERROR
- alert: CephWarnState
expr: ceph_health_status == 1
for: 30m
labels:
severity: warning
receivers: ",slack,"
annotations:
description: |
Ceph is in Warn state on {{ $labels.prom_cluster }} for longer than
30m, please check status of pools and OSDs
summary: CEPH in WARN
- alert: OsdDown
expr: ceph_osd_up == 0
for: 30m
labels:
severity: warning
receivers: ",slack,"
annotations:
description: |
OSD is down longer than 30 min on {{ $labels.prom_cluster }}, please
check whats the status
summary: OSD down
- alert: OsdApplyLatencyTooHigh
expr: ceph_osd_apply_latency_ms > 5000
for: 90s
labels:
severity: warning
receivers: ",slack,"
annotations:
description: |
OSD latency for {{ $labels.osd }} is too high on {{
$labels.prom_cluster }}. Please check if it doesn't stuck in weird
state
summary: OSD latency too high {{ $labels.osd }}
- alert: CephPgDown
expr: ceph_pg_down > 0
for: 3m
labels:
severity: critical
receivers: ",slack,"
annotations:
description: |
Some groups are down (unavailable) for too long on {{
$labels.prom_cluster }}. Please ensure that all the data are
available
summary: PG DOWN [{{ $value }}] on {{ $labels.prom_cluster }}
- alert: CephPgIncomplete
expr: ceph_pg_incomplete > 0
for: 2m
labels:
severity: critical
receivers: ",slack,"
annotations:
description: |
Some groups are incomplete (unavailable) for too long on {{
$labels.prom_cluster }}. Please ensure that all the data are
available
summary: PG INCOMPLETE [{{ $value }}] on {{ $labels.prom_cluster }}
- alert: CephPgInconsistent
expr: ceph_pg_inconsistent > 0
for: 1m
labels:
severity: warning
receivers: ",slack,"
annotations:
description: |
Some groups are inconsistent for too long on {{ $labels.prom_cluster
}}. Data is available but inconsistent across nodes
summary: PG INCONSISTENT [{{ $value }}] on {{ $labels.prom_cluster }}
- alert: CephPgActivating
expr: ceph_pg_activating > 0
for: 5m
labels:
severity: critical
receivers: ",slack,"
annotations:
description: |
Some groups are activating for too long on {{ $labels.prom_cluster
}}. Those PGs are unavailable for too long!
summary: PG ACTIVATING [{{ $value }}] on {{ $labels.prom_cluster }}
- alert: CephPgBackfillTooFull
expr: ceph_pg_backfill_toofull > 0
for: 5m
labels:
severity: warning
receivers: ",slack,"
annotations:
description: |
Some groups are located on full OSD on cluster {{
$labels.prom_cluster }}. Those PGs can be unavailable shortly. Please
check OSDs, change weight or reconfigure CRUSH rules.
summary: PG TOO FULL [{{ $value }}] on {{ $labels.prom_cluster }}
- alert: CephPgUnavailable
expr: ceph_pg_total - ceph_pg_active > 0
for: 5m
labels:
severity: critical
receivers: ",slack,"
annotations:
description: |
Some groups are unavailable on {{ $labels.prom_cluster }}. Please
check their detailed status and current configuration.
summary: PG UNAVAILABLE [{{ $value }}] on {{ $labels.prom_cluster }}
- alert: CephOsdReweighted
expr: ceph_osd_weight < 1
for: 1h
labels:
severity: warning
receivers: ",slack,"
annotations:
description: |
OSD on cluster {{ $labels.prom_cluster}} was reweighted for too long.
Please either create silent or fix that issue
summary: OSD {{ $labels.ceph_daemon }} on {{ $labels.prom_cluster }} reweighted - {{ $value }}
Loading
Loading