From 987d1e8179f781cb68e1fc2baa1142dc3a2c11bc Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Edu=20Gonz=C3=A1lez=20de=20la=20Herr=C3=A1n?= <25320357+eedugon@users.noreply.github.com> Date: Tue, 22 Oct 2024 12:30:07 +0200 Subject: [PATCH] initial draft prepared --- docs/en/observability/slo-create.asciidoc | 6 + .../observability/slo-troubleshoot.asciidoc | 373 +++++++++++++++++- 2 files changed, 358 insertions(+), 21 deletions(-) diff --git a/docs/en/observability/slo-create.asciidoc b/docs/en/observability/slo-create.asciidoc index e1da49a152..c6b45cf70a 100644 --- a/docs/en/observability/slo-create.asciidoc +++ b/docs/en/observability/slo-create.asciidoc @@ -18,6 +18,12 @@ From here, complete the following steps: . <>. . <>. +[NOTE] +==== +For SLOs to function, the cluster must include one or more nodes with both `ingest` and `transform` {ref}/modules-node.html#node-roles[roles] (they can co-exist or be distributed across separate nodes). +On ESS deployments (Elastic Cloud), this is handled by the hot nodes, which serve as both `ingest` and `transform` nodes. +==== + [discrete] [[define-sli]] = Define your SLI diff --git a/docs/en/observability/slo-troubleshoot.asciidoc b/docs/en/observability/slo-troubleshoot.asciidoc index 9cc190613c..88045ca240 100644 --- a/docs/en/observability/slo-troubleshoot.asciidoc +++ b/docs/en/observability/slo-troubleshoot.asciidoc @@ -1,5 +1,5 @@ [[slo-troubleshoot-slos]] -= Troubleshoot SLOs += Troubleshoot service-level objectives (SLOs) ++++ Troubleshoot SLOs @@ -12,44 +12,375 @@ To create and manage SLOs, you need an {subscriptions}[appropriate license] and ==== // end::slo-license[] -This section provides solutions to common questions and problems, -and processing and performance guidance. +This document provides an overview of common issues encountered when working with service-level objectives (SLOs). It explores the relationships between SLOs and other core functionalities within the stack, such as {ref}/transforms.html[transforms] and {ref}/ingest.html[ingest pipelines], highlighting how these integrations can impact the functionality of SLOs. -TBD (table of contents with links APM style?) +* <> +* <> +* <> +* <> +** <> +** <> +** <> +** <> +** <> +* <> +** <> +** <> +** <> +** <> +** <> +** <> +** <> +* <> +** <> -screenshot available: https://github.com/elastic/kibana/pull/181351 +[discrete] +[[slo-resources-details]] +== SLO Overview -transforms troubleshooting doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/transform-troubleshooting.html +An SLO is represented by several system resources: +* *Definition*: Stored as a Kibana Saved Object +* *Transforms*: For each SLO, {kib} creates two transforms: + * *Rolling-up transform*: rolls up the data into a smaller set of documents. + * *Summarising transform*: Updates the latest values, such as the observed SLI or remaining error budget, for efficient searching and filtering of SLOs. +* *Additional resources*: {kib} also installs and manages shared resources to the SLOs, including index templates, indices, and ingest pipelines, among others. -explain a SLO rely on Transforms (at least on stateful). +The rollup documents are stored in `.slo-observability.sli-v3` (index split per month through an ingest pipeline) while summary documents are stored in `.slo-observability.summary-v3`. -- Help users to to make use of the warning page introduced in 8.15 about unhealthy transforms, and point to the transform troubleshooting docs (https://www.elastic.co/guide/en/elasticsearch/reference/current/transform-troubleshooting.html). We shouldn't focus much on transforms troubleshooting as it's not our domain. +Each time an SLO is updated, a new transform is created using the latest definition. The transform ID is generated by combining the SLO id and the SLO revision, following the format: `slo-{slo.id}-{slo.revision}`. -- explain a SLO rely on Transforms and the cluster requires at least 1 node with the transform role +One of the common issues with SLOs arises when there are underlying problems in the cluster, such as unavailable shards or failed transforms. Since SLOs rely on transforms to aggregate and process data, any failure or misconfiguration in these components can lead to inaccurate or incomplete SLO calculations. Additionally, unavailable shards can affect the data retrieval process, further complicating the reliability of SLO metrics. -- explain a SLO rely on Ingest pipelines and the cluster requires at least 1 node with the ingest role +Ensuring that transforms are functioning correctly and that the cluster is healthy is crucial for maintaining accurate and reliable SLOs. -- explain SLO relies on some built-in transforms slo-summary* and those have not be deleted or stopped - and eventually document if they're auto-recreated (or how to do it) +[discrete] +[[slo-and-transforms]] +== SLOs and Transforms -associated issue: https://github.com/elastic/observability-docs/issues/4237 +(TBD: should we skip this section if we have already explained enough in the previous one?) +When working with Service Level Objectives (SLOs) in Elasticsearch, ensuring that the associated transforms function correctly is crucial. Transforms are responsible for generating the data needed for SLOs, and typically, two transforms are created for each SLO. If you notice that your SLOs are not displaying the expected data, it's time to check the health of these associated transforms. +[discrete] +[[slo-and-ingest]] +== SLOs and Ingest Pipelines + +(anything specific to add here more than the previous content?) +(should we add more details about index templates and indices being used?) [discrete] -[[slo-and-transforms]] -== SLO and Transforms relation +[[slo-common-problems]] +== Common Problems + +[discrete] +[[slo-no-transform-ingest-node]] +=== No transform or ingest nodes + +Since SLOs depend on both ingest pipelines and transforms to process the data, it's essential to ensure that the cluster has nodes with the appropriate {ref}/modules-node.html#node-roles[roles]. + +Ensure the cluster includes one or more nodes with both `ingest` and `transform` {ref}/modules-node.html#node-roles[roles] (they can co-exist or be distributed across separate nodes), to support the data processing and transformations required for SLOs to function properly. + +[discrete] +[[slo-transform-unhealthy]] +=== Unhealthy transforms + +(TBD: pending introductory text) +UI message: "The following transform is an unhealthy state" +(add screenshot of how the unhealthy transform report looks like, the warning introduced in 8.15) + +Possible reasons: +* SLO source data is malformed (example problems parsing timestamps) + +Refer to the {ref}/transform-troubleshooting.html[troubleshooting transforms] documentation for detailed guidance on diagnosing and resolving transform-related issues. + +( +TBD: These 2 KBs have also very good data, should we cover them here or link them? +https://support.elastic.co/knowledge/8669ddeb +https://support.elastic.co/knowledge/ad17899e (this is purely for transforms, probably irrelevant here) +) + +Transforms checks: +* Ensure the needed transforms for the SLOs haven't been deleted or stopped. ++ +If a transform has been deleted the easiest way to recreate it is to update the SLO, as every time the SLO is updated a new transform will be created. +* Other checks? (TBD) + +Tips: + +* Fetch a specific transform for a given SLO using this query: + +[source,console] +---------------------------------- +GET kbn:/s/{space}/api/saved_objects/_find?type=slo +---------------------------------- + +GET _transform/slo-{id}-{revision} + +* Fetch all transforms related with SLOs using: +[source,console] +---------------------------------- +GET _transform/slo-* +---------------------------------- + +* Fetch stats of a given transform: + +[source,console] +---------------------------------- +GET _transforms/id/_stats +---------------------------------- + +[discrete] +[[slo-missing-pipeline]] +=== Missing Ingest Pipelines -Explain the relation between SLOs and transforms +(decide what to do here) [discrete] -[[slo-and-pipelines]] -== SLO and Ingest Pipelines +[[slo-missing-template]] +=== Missing Templates -Explain the relation between SLOs and Ingest pipelines +(decide what to do here) [discrete] -[[transforms-troubleshoot]] -== Transforms troubleshooting +[[slo-missing-indices]] +=== Missing Indices or Shards + +(decide what to do here. I'm sharing error examples I have collected to see if it makes sense to offer some background and context for issues that are not really related with SLOs logic but with other parts of the stack). + +Other examples: +> Failed to execute phase [can_match], start; org.elasticsearch.action.search.SearchPhaseExecutionException: Search rejected due to missing shards [[.ds-metrics-apm.internal-default-2024.06.08-000030][1], [.ds-metrics-apm.service_transaction.1m-default-2024.06.07-000023][1], [.ds-metrics-apm.transaction.1m-default-2024.06.07-000024][1]]. Consider using `allow_partial_search_results` setting to bypass this error. + +another (unavailable remote cluster (CCS)) +> Validation Failed: 1: no such remote cluster: [metrics];2: no such remote cluster: [metrics]; + +> Some Transform failures can be totally unrelated to SLO/O11y but to platform (example: circuit breaker exceptions due to low memory on ES side). + +[source,bash] +---- + "reason": """Failed to index documents into destination index due to permanent error: [org.elasticsearch.xpack.transform.transforms.BulkIndexingException: Bulk index experienced [500] failures and at least 1 irrecoverable [unable to parse date [1702842480000]]. Other failures: +[IngestProcessorException] message [org.elasticsearch.ingest.IngestProcessorException: java.lang.IllegalArgumentException: unable to parse date [1702842480000]]; java.lang.IllegalArgumentException: unable to parse date [1702842480000]]""", + + "issue": "Transform task state is [failed]", + "details": """Failed to index documents into destination index due to permanent error: [org.elasticsearch.xpack.transform.transforms.BulkIndexingException: Bulk index experienced [500] failures and at least 1 irrecoverable [unable to parse date [1702842480000]]. Other failures: +[IngestProcessorException] message [org.elasticsearch.ingest.IngestProcessorException: java.lang.IllegalArgumentException: unable to parse date [1702842480000]]; java.lang.IllegalArgumentException: unable to parse date [1702842480000]]""", + "count": 1 +---- + +[discrete] +[[slo-troubleshoot-beta]] +=== After upgrading from a Beta version, SLOs don't show up + +If upgrading from a Beta version (<8.12) to 8.12+, it is possible that some SLOs are not recoverable. Therefore it is recommended to clean up any residual resources and start fresh. + +In order to completely remove an SLO and its resources you have to: + +. Remove the rollup transform: `slo-{id}-{revision}`. + +. Remove the summary transform: `slo-summary-{id}-{revision}`. + +. Remove the summary ingest pipeline: `.slo-observability.summary.pipeline-{id}-{revision}`. + +. Remove the SLO saved object. + + +[discrete] +[[slo-api-calls]] +== Using API calls to retrieve SLO details + +The following {kib} API calls are useful to retrieve different level of details of the SLOs and surrounding components. + +[discrete] +[[slo-api-find]] +=== Find SLO definitions + +You can achieve this in multiple ways: + +* From Saved Objects + +The following query returns the stored SLO definitions. SLO, and therefore this API, is space aware. + +[source,console] +---------------------------------- +GET kbn:/s/{space}/api/saved_objects/_find?type=slo +---------------------------------- + +* Through _definitions API + +The following internal API returns the SLO definitions. It is space aware. + +[source,console] +---------------------------------- +GET kbn:/s/{space}/api/observability/slos/_definitions +---------------------------------- + +* Through slos API + +The following public API returns the total number of SLOs, including the group by instances. It is space aware. + +[source,console] +---------------------------------- +GET kbn:/s/{space}/api/observability/slos +---------------------------------- + +* Through UI + +Users can also get the total number of SLOs through the SLO UI. In the SLO Overview page we display the total number of SLOs. + +* Via Raw Kibana index + +[source,console] +---------------------------------- +GET .kibana*/_search +{ + "size": 10, # adjust this + "query": { + "term": { + "type": { + "value": "slo" + } + } + } +} +---------------------------------- + + +[discrete] +[[slo-api-find-specific]] +=== Find definition for a specific SLO + +The following internal API returns the SLO definition for a specific SLO, filtered by the name of the SLO: + +[source,console] +---------------------------------- +GET kbn:/api/observability/slos/_definitions?search=Some SLO +---------------------------------- + + + +[discrete] +[[slo-api-find-rollup]] +=== Find rollup SLO transforms + +Each SLO creates a rollup transform, and everytime you update the SLO a new transform is created with the latest definition. + +The transform id is built with the slo id and the slo revision as `slo-{slo.id}-{slo.revision}`. + +Fetch a specific transform for a given SLO using this call: + +[source,console] +---------------------------------- +GET _transform/slo-{id}-{revision} +---------------------------------- + +You can also fetch all transforms using: + +GET _transform/slo-* + +[discrete] +[[slo-api-rollup-documents]] +=== Search the rollup documents for an SLO + +It can be useful to fetch the latest rollup document for a given slo id and optionally an instance id, in case investigating why an SLO shows as no data for too long. + +[source,console] +---------------------------------- +POST .slo-observability.sli-v3*/_search +{ + "sort": [ + { + "event.ingested": { + "order": "desc" + } + } + ], + "query": { + "bool": { + "filter": [ + { + "term": { + "slo.id": "id" + } + }, + { + "term": { + "slo.instanceId": "instanceId" + } + } + ] + } + } +} +---------------------------------- + +[discrete] +[[slo-api-summary-documents]] +=== Search the summary documents for an SLO + +It can be useful to fetch the latest summary document for a given slo id and optionally an instance id: + +[source,console] +---------------------------------- +POST .slo-observability.summary-v3*/_search +{ + "query": { + "bool": { + "filter": [ + { + "term": { + "slo.id": "id" + } + }, + { + "term": { + "slo.instanceId": "instanceId" + } + } + ] + } + } +} +---------------------------------- + +[discrete] +[[slo-troubleshoot-inspect]] +=== Inspect SLO Assets + +In order to inspect any of the following: + +. SLO Configuration +. Rollup Transform Configuration +. Summary Transform Configuration +. SLO Ingest Pipeline +. Temporary Document + +Follow the steps: +. Open Kibana's *Stack Management* -> *Advanced Settings* +. Enable `observability:enableInspectEsQueries` +. Visit the SLO edit page and click on *SLO Inspect* + +[discrete] +[[slo-troubleshooting-actions]] +== Actions (TBD) + +intro text? + +[discrete] +[[slo-troubleshooting-reset]] +=== Reset transforms + +[NOTE] +==== +While resetting an SLO can help resolve certain issues, it may not always address the root cause of errors. Most errors related to transforms typically arise from improperly structured source data, such as unparseable timestamps, which prevent the transform from progressing. Additionally, misformatted SLO queries, and consequently transform queries, can also lead to failures. + +Therefore, before resetting the SLO, verify that the source data and queries are correctly formatted and validated. Resetting should only be used as a last resort when all other troubleshooting steps have been exhausted. +==== + +If you are on 8.12+, you should try and reset the SLO using the following Dev console command to reset the SLO: + +[source,console] +---- +POST kbn:/api/observability/slos/{sloId}/_reset +---- -Introduction and link to the relevant doc about transforms. +This action deletes all SLI data, summary data, and transforms, and then reinstalls and processes the data. Essentially, it recreates the SLO as if it had been deleted and re-created by the user.