Skip to content

Commit

Permalink
Add docs about SLO burn rate alert details page (#3846) (#3861)
Browse files Browse the repository at this point in the history
* Add docs about SLO burn rate alert details page

* Fix ID

* Apply suggestions from mdbirnstiehl

(cherry picked from commit 6231d4a)

Co-authored-by: DeDe Morton <[email protected]>
  • Loading branch information
mergify[bot] and dedemorton authored May 6, 2024
1 parent d96b644 commit f5d5b76
Show file tree
Hide file tree
Showing 11 changed files with 64 additions and 9 deletions.
Binary file removed docs/en/observability/images/action-dropdown.png
Binary file not shown.
Binary file removed docs/en/observability/images/app-link-icon.png
Binary file not shown.
1 change: 1 addition & 0 deletions docs/en/observability/images/icons/boxesHorizontal.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/en/observability/images/icons/boxesVertical.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/en/observability/images/icons/eye.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/en/observability/index.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,7 @@ include::profiling-self-managed-troubleshooting.asciidoc[leveloffset=+3]
include::create-alerts.asciidoc[leveloffset=+1]
include::aggregation-options.asciidoc[leveloffset=+2]
include::view-observability-alerts.asciidoc[leveloffset=+2]
include::triage-slo-burn-rate-breaches.asciidoc[leveloffset=+3]

//SLOs
include::slo-overview.asciidoc[leveloffset=+1]
Expand Down
11 changes: 10 additions & 1 deletion docs/en/observability/slo-burn-rate-alert.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -79,4 +79,13 @@ You an also specify {kibana-ref}/rule-action-variables.html[variables common to
To receive a notification when the alert recovers, select *Run when Recovered*. Use the default notification message or customize it. You can add more context to the message by clicking the icon above the message text box and selecting from a list of available variables.

[role="screenshot"]
image::images/duration-anomaly-alert-recovery.png[Default recovery message for Uptime duration anomaly rules with open "Add variable" popup listing available action variables,width=600]
image::images/duration-anomaly-alert-recovery.png[Default recovery message for Uptime duration anomaly rules with open "Add variable" popup listing available action variables,width=600]

[discrete]
[[slo-creation-next-steps]]
== Next steps

Learn how to view alerts and triage SLO burn rate breaches:

* <<view-observability-alerts, View alerts>>
* <<triage-slo-burn-rate-breaches, Triage SLO burn rate breaches>>
7 changes: 5 additions & 2 deletions docs/en/observability/slo-overview.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ Once an SLO is reset, it will start to regenerate SLIs and summary data.
[%collapsible]
.Remove legacy summary transforms
====
After migrating to 8.12 or later, you might have some legacy SLO summary transforms running.
After migrating to 8.12 or later, you might have some legacy SLO summary transforms running.
You can safely delete the following legacy summary transforms:
[source,sh]
Expand Down Expand Up @@ -153,8 +153,11 @@ Do not delete any new summary transforms used by your migrated SLOs.
[discrete]
[[slo-overview-next-steps]]
== Next steps
To get started using SLOs to measure your service performance, see the following pages:

Get started using SLOs to measure your service performance:

* <<slo-privileges, Configure SLO access>>
* <<slo-create, Create an SLO>>
* <<slo-burn-rate-alert, Create an SLO burn rate alert rule>>
* <<view-observability-alerts, View alerts>>
* <<triage-slo-burn-rate-breaches, Triage SLO burn rate breaches>>
39 changes: 39 additions & 0 deletions docs/en/observability/triage-slo-burn-rate-breaches.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
[[triage-slo-burn-rate-breaches]]
= Triage SLO burn rate breaches
++++
<titleabbrev>SLO burn rate breaches</titleabbrev>
++++

SLO burn rate breaches occur when the percentage of bad events over a specified time period exceeds the threshold set in your <<slo-burn-rate-alert,SLO burn rate rule>>.
When this happens, you are at risk of exhausting your error budget and violating your SLO.

To triage issues quickly, go to the alert details page:

. Go to **{observability}** -> **Alerts** (or open the SLO and click **Alerts**.)
. From the Alerts table, click the image:images/icons/boxesHorizontal.svg[More actions icon] icon next to the alert and select **View alert details**.

The alert details page shows information about the alert, including when the alert was triggered,
the duration of the alert, the source SLO, and the rule that triggered the alert.
You can follow the links to navigate to the source SLO or rule definition.

Explore charts on the page to learn more about the SLO breach:

[role="screenshot"]
image::images/slo-burn-rate-breach.png[Alert details for SLO burn rate breach]

* The first chart shows the burn rate during the time range when the alert was active.
The line indicates how close the SLO came to breaching the threshold.
* The next chart shows the alerts history over the last 30 days.
It shows the number of alerts that were triggered and the average time it took to recover after a breach.
* Both timelines are annotated to show when the threshold was breached.
You can hover over an alert icon to see the timestamp of the alert.

The number, duration, and frequency of these breaches over time gives you an indication of how severely the service is degrading so that you can focus on high severity issues first.

NOTE: The contents of the alert details page may vary depending on the type of SLI that's defined in the SLO.

After investigating the alert, you may want to:

* Click **Snooze the rule** to snooze notifications for a specific time period or indefinitely.
* Click the image:images/icons/boxesVertical.svg[Actions] icon and select **Add to case** to add the alert to a new or existing case. To learn more, refer to <<create-cases>>.
* Click the image:images/icons/boxesVertical.svg[Actions] icon and select **Mark as untracked**.
12 changes: 6 additions & 6 deletions docs/en/observability/view-observability-alerts.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ By default, this filter is set to *Show all* alerts, but you can filter to show
An alert is "Active" when the condition defined in the rule currently matches.
An alert has "Recovered" when that condition, which previously matched, is currently no longer matching.
An alert is "Untracked" when its corresponding rule is disabled or you mark the alert as untracked.
To mark the alert as untracked, go to the Alerts table, click image:images/action-dropdown.png[Three dots used to expand the "More actions" menu,height=22] to expand the "More actions" menu, and click *Mark as untracked*.
To mark the alert as untracked, go to the Alerts table, click the image:images/icons/boxesHorizontal.svg[More actions] icon to expand the "More actions" menu, and click *Mark as untracked*.

NOTE: There is also a "Flapping" status, which means the alert is switching repeatedly between active and recovered states.
This status is possible only if you have enabled alert flapping detection.
Expand All @@ -55,17 +55,17 @@ image::view-alert-details.png[View alert details flyout on the Alerts page]
To further inspect the alert:

* From the alert detail flyout, click *Alert details*.
* From the Alerts table, use the image:images/action-dropdown.png[Three dots used to expand the "More actions" menu,height=22] and click *View alert details*.
* From the Alerts table, click the image:images/icons/boxesHorizontal.svg[More actions] icon and select *View alert details*.

To further inspect the rule:

* From the alert detail flyout, click *View rule details*.
* From the Alerts table, use the image:images/action-dropdown.png[Three dots used to expand the "More actions" menu,height=22] and click *View rule details*.
* From the Alerts table, click the image:images/icons/boxesHorizontal.svg[More actions] icon and select *View rule details*.

To view the alert in the app that triggered it:

* From the alert detail flyout, click *View in app*.
* From the Alerts table, click the image:images/app-link-icon.png[Eye icon used to "View in app",height=22].
* From the Alerts table, click the image:images/icons/eye.svg[View in app] icon.

[discrete]
[[customize-observability-alerts-table]]
Expand All @@ -89,8 +89,8 @@ You can also use the toolbar buttons in the upper-right to customize the display
[[cases-observability-alerts]]
== Add alerts to cases

From the Alerts table, you can add one or more alerts to a case. Select image:images/action-dropdown.png[Three dots used to expand the "More actions" menu,height=22]
to add the alert to a new case or add it to an existing case. You can add an unlimited amount of alerts from any rule type.
From the Alerts table, you can add one or more alerts to a case. Click the image:images/icons/boxesHorizontal.svg[More actions] icon
to add the alert to a new or existing case.

NOTE: Each case can have a maximum of 1,000 alerts.

Expand Down

0 comments on commit f5d5b76

Please sign in to comment.