-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(fleet/prometheus rules) GitOps prom rules #346
Open
fbegyn
wants to merge
16
commits into
master
Choose a base branch
from
IT-5303-gitops-prom-rules
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
7a01fe9
(fleet/prometheus/alerts) set up gitops for alert deployment
fbegyn 0d90c79
(fleet/prometheus) modify alerting stack triggers
fbegyn 0885a7c
(fleet/alertrules) remove default receivers and update docs
fbegyn 882d4ad
(fleet/alerts) add ceph alerts
fbegyn 75ded13
(fleet/alerting) remove ceph override
fbegyn e8d6b86
(fleet/alerts) include cluster in slack alerts
fbegyn b44d94c
(fleet/alerts) move slack credentials to cluster overlay
fbegyn 78afed4
(fleet/alerts) remove `atomic` default from fleet.yaml
fbegyn 9354afa
(fleet/alerts) insert templating values for node disk alerts
fbegyn 58cd0b1
(fleet/alerts) make yamllint happy
fbegyn b7a82b4
(fleet/prometheusrule) remove explicit timeout
fbegyn 62d8d60
(fleet/alerts) remove duplicate docs in favor of top level
fbegyn 22f5863
(fleet/prom-stack) make yamllint happy
fbegyn 5735726
Update externalsecret-grafana-keycloak-credentials.yaml
fbegyn a9e34dc
Disabling default deployment of K8s dashboards on grafana, in order t…
KrisBuytaert a024f6c
Fixing Yaml
KrisBuytaert File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
# Prometheus rules GitOps | ||
|
||
Any Prometheus rules file defined in the | ||
[fleet/lib/prometheus-alertrules/rules](../../prometheus-alertrules/rules) | ||
directory will be deployed to the cluster. It's possible to define a default | ||
namespace in the `values.yaml` file with the `rules.namespace` key. | ||
|
||
## Adding Prometheus rules | ||
|
||
1. Write the Prometheus rules in a yaml file according to the [prometheus | ||
specification](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/). | ||
1. Add the YAML file to the `/rules` directory | ||
1. Commit | ||
|
||
## Prometheus rule AURA standards | ||
|
||
* `summary` annotation: The `summary` annotation is used to be able to describe a | ||
group of alerts incomming. This annotation DOES NOT contain any templated | ||
variables and provides a simple single sentence summary of what the alert is | ||
about. For example "Disk space full in 24h". When a cluster triggers several | ||
alerts, it can be hany to group these alerts into a single notification, this | ||
is when the `summary` can be used. | ||
* `discription` annotation: This provides a detailed overview of the alert | ||
specifically to this instance of the alert. It MAY contain templated variables | ||
to enrich the message. | ||
* `receiver` label: The receiver label is used by alertmanager to decide on the | ||
routing of the notification for the alert. It exists out of `,` seperated list | ||
of receivers, pre- and suffixed with `,` to make regex matching easier in the | ||
alertmanager. For example: `,slack,squadcast,email,` The receivers are defined | ||
in the alertmanager configuration. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,4 +18,4 @@ spec: | |
- secretKey: keycloak_url | ||
remoteRef: | ||
key: *item | ||
property: hostname | ||
property: url |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
apiVersion: v2 | ||
appVersion: 0.1.0 | ||
description: Prometheus LSST rules GitOps | ||
name: lsst-prometheus-alerts | ||
version: 0.1.1 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
--- | ||
defaultNamespace: &name lsst-prometheus-alerts | ||
labels: | ||
bundle: *name | ||
namespaceLabels: | ||
lsst.io/discover: "true" | ||
helm: | ||
releaseName: *name | ||
takeOwnership: true | ||
waitForJobs: false | ||
dependsOn: | ||
- selector: | ||
matchLabels: | ||
bundle: prometheus-operator-crds |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,164 @@ | ||
groups: | ||
- name: ceph.rules | ||
rules: | ||
- alert: CephQuotaFillingUp | ||
annotations: | ||
summary: The Ceph pool quota in cluster {{ $labels.prom_cluster }} is almost full | ||
description: | | ||
Ceph pool id {{ $labels.pool_id }} on {{ $labels.prom_cluster }}/ {{ | ||
$labels.namespace }}/{{ $labels.pod }} is at {{ $value }}%. Please | ||
keep in mind that ceph pools reaching 100% is dangerous. | ||
labels: | ||
secverity: warning | ||
receivers: ",slack," | ||
expr: | | ||
(ceph_pool_stored/ceph_pool_quota_bytes > 0.75 and ceph_pool_quota_bytes != 0)*100 | ||
- alert: CephQuotaFillingUp | ||
annotations: | ||
summary: The Ceph pool quota is almost full | ||
description: | | ||
Ceph pool id {{ $labels.pool_id }} on {{ $labels.prom_cluster }}/ {{ | ||
$labels.namespace }}/{{ $labels.pod }} is at {{ $value }}%. Please | ||
keep in mind that ceph pools reaching 100% is dangerous. | ||
labels: | ||
secverity: critical | ||
receivers: ",slack," | ||
expr: | | ||
(ceph_pool_stored/ceph_pool_quota_bytes > 0.9 and ceph_pool_quota_bytes != 0)*100 | ||
- alert: CephTargetDown | ||
expr: up{job=".*ceph.*"} == 0 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
for: 10m | ||
labels: | ||
severity: critical | ||
receivers: ",slack," | ||
annotations: | ||
description: | | ||
CEPH target on {{ $labels.prom_cluster }} down for more than 2m, | ||
please check - it could be a either exporter crash or a whole cluster | ||
crash | ||
summary: CEPH exporter down on {{ $labels.prom_cluster }} | ||
- alert: CephErrorState | ||
expr: ceph_health_status > 1 | ||
for: 5m | ||
labels: | ||
severity: critical | ||
receivers: ",slack," | ||
annotations: | ||
description: | | ||
Ceph is in Error state on {{ $labels.prom_cluster }} for longer than | ||
5m, please check status of pools and OSDs | ||
summary: CEPH in ERROR | ||
- alert: CephWarnState | ||
expr: ceph_health_status == 1 | ||
for: 30m | ||
labels: | ||
severity: warning | ||
receivers: ",slack," | ||
annotations: | ||
description: | | ||
Ceph is in Warn state on {{ $labels.prom_cluster }} for longer than | ||
30m, please check status of pools and OSDs | ||
summary: CEPH in WARN | ||
- alert: OsdDown | ||
expr: ceph_osd_up == 0 | ||
for: 30m | ||
labels: | ||
severity: warning | ||
receivers: ",slack," | ||
annotations: | ||
description: | | ||
OSD is down longer than 30 min on {{ $labels.prom_cluster }}, please | ||
check whats the status | ||
summary: OSD down | ||
- alert: OsdApplyLatencyTooHigh | ||
expr: ceph_osd_apply_latency_ms > 5000 | ||
for: 90s | ||
labels: | ||
severity: warning | ||
receivers: ",slack," | ||
annotations: | ||
description: | | ||
OSD latency for {{ $labels.osd }} is too high on {{ | ||
$labels.prom_cluster }}. Please check if it doesn't stuck in weird | ||
state | ||
summary: OSD latency too high {{ $labels.osd }} | ||
- alert: CephPgDown | ||
expr: ceph_pg_down > 0 | ||
for: 3m | ||
labels: | ||
severity: critical | ||
receivers: ",slack," | ||
annotations: | ||
description: | | ||
Some groups are down (unavailable) for too long on {{ | ||
$labels.prom_cluster }}. Please ensure that all the data are | ||
available | ||
summary: PG DOWN [{{ $value }}] on {{ $labels.prom_cluster }} | ||
- alert: CephPgIncomplete | ||
expr: ceph_pg_incomplete > 0 | ||
for: 2m | ||
labels: | ||
severity: critical | ||
receivers: ",slack," | ||
annotations: | ||
description: | | ||
Some groups are incomplete (unavailable) for too long on {{ | ||
$labels.prom_cluster }}. Please ensure that all the data are | ||
available | ||
summary: PG INCOMPLETE [{{ $value }}] on {{ $labels.prom_cluster }} | ||
- alert: CephPgInconsistent | ||
expr: ceph_pg_inconsistent > 0 | ||
for: 1m | ||
labels: | ||
severity: warning | ||
receivers: ",slack," | ||
annotations: | ||
description: | | ||
Some groups are inconsistent for too long on {{ $labels.prom_cluster | ||
}}. Data is available but inconsistent across nodes | ||
summary: PG INCONSISTENT [{{ $value }}] on {{ $labels.prom_cluster }} | ||
- alert: CephPgActivating | ||
expr: ceph_pg_activating > 0 | ||
for: 5m | ||
labels: | ||
severity: critical | ||
receivers: ",slack," | ||
annotations: | ||
description: | | ||
Some groups are activating for too long on {{ $labels.prom_cluster | ||
}}. Those PGs are unavailable for too long! | ||
summary: PG ACTIVATING [{{ $value }}] on {{ $labels.prom_cluster }} | ||
- alert: CephPgBackfillTooFull | ||
expr: ceph_pg_backfill_toofull > 0 | ||
for: 5m | ||
labels: | ||
severity: warning | ||
receivers: ",slack," | ||
annotations: | ||
description: | | ||
Some groups are located on full OSD on cluster {{ | ||
$labels.prom_cluster }}. Those PGs can be unavailable shortly. Please | ||
check OSDs, change weight or reconfigure CRUSH rules. | ||
summary: PG TOO FULL [{{ $value }}] on {{ $labels.prom_cluster }} | ||
- alert: CephPgUnavailable | ||
expr: ceph_pg_total - ceph_pg_active > 0 | ||
for: 5m | ||
labels: | ||
severity: critical | ||
receivers: ",slack," | ||
annotations: | ||
description: | | ||
Some groups are unavailable on {{ $labels.prom_cluster }}. Please | ||
check their detailed status and current configuration. | ||
summary: PG UNAVAILABLE [{{ $value }}] on {{ $labels.prom_cluster }} | ||
- alert: CephOsdReweighted | ||
expr: ceph_osd_weight < 1 | ||
for: 1h | ||
labels: | ||
severity: warning | ||
receivers: ",slack," | ||
annotations: | ||
description: | | ||
OSD on cluster {{ $labels.prom_cluster}} was reweighted for too long. | ||
Please either create silent or fix that issue | ||
summary: OSD {{ $labels.ceph_daemon }} on {{ $labels.prom_cluster }} reweighted - {{ $value }} |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why shouldn't templated values be in the summary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've written this quite globally for all kinds of levels of Prometheus alerting understanding. The tricky part is that you want the summary to be more or less the same for all the alerts in the same grouping. The risk of simply allowing templated variables in the summary is that each summary will be unique, making alert grouping less useful IMHO.
Not really a technical limitation, more dependent on the "alert standard".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am having trouble coming up with a scenario where the summary would need to be used for grouping instead of the alertname. I think that having interpolated values in the summary is useful information. Alert manager's grouping is rather primitive and in most cases, we don't want any grouping unless there is a high alert volume, which alertmanager isn't capable of doing. I suspect we will end up shipping everything to squadcast ungrouped anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll adjust the config and standard to that use case then (no grouping unless sepcified). In that use case the summary can really be templated as wanted.
In what I described, the summary wouldn't really be used to group. Say you group on
alertname,cluster
. The summary could then be used to provide short information bursed about the alert, without overloading the alert message with specific details if 25 pods are crashing in a single cluster. The only "useful" template value then used on the summary would becluster
really, so it's sometimes easier to avoid them completely.However if grouping is the exception, rather then the norm (as would be the case if alerts are send to subsequent systems), the summary can be formatted however. Although in that case I'd argue that the difference between a
summary
/description
is neglectable and you'd might as well just use the description.