Skip to content

Commit

Permalink
Add doc for diagnosing backpressure from Elasticsearch (#4097)
Browse files Browse the repository at this point in the history
* initial apm-es-backpressure doc draft

* address review comments

* fix internal doc references

* address review comments

* es backpressure troubleshoot doc fmt fix

* address comments

* fix doc typo

* add not recommended banner
  • Loading branch information
1pkg authored Aug 6, 2024
1 parent 4471279 commit 406f676
Show file tree
Hide file tree
Showing 2 changed files with 55 additions and 1 deletion.
51 changes: 51 additions & 0 deletions docs/en/observability/apm/apm-performance-diagnostic.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
[[apm-performance-diagnostic]]
=== APM Server performance diagnostic

[[apm-es-backpressure]]
[float]
==== Diagnosing backpressure from {es}

When {es} is under excessive load or indexing pressure, APM Server could experience the downstream backpressure when indexing new documents into {es}.
Most commonly, backpressure from {es} will manifest itself in the form of higher indexing latency and/or rejected requests, which in return could lead APM Server to deny incoming requests.
As a result, APM agents connected to the affected APM Server will suffer from throttling and/or request timeout when shipping APM events.

To quickly identify possible issues try looking for similar error logs lines in APM Server logs:

[source,json]
----
...
{"log.level":"error","@timestamp":"2024-07-27T23:46:28.529Z","log.origin":{"function":"github.com/elastic/go-docappender/v2.(*Appender).flush","file.name":"[email protected]/appender.go","file.line":370},"message":"bulk indexing request failed","service.name":"apm-server","error":{"message":"flush failed (429): [429 Too Many Requests]"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-07-27T23:55:38.612Z","log.origin":{"function":"github.com/elastic/go-docappender/v2.(*Appender).flush","file.name":"[email protected]/appender.go","file.line":370},"message":"bulk indexing request failed","service.name":"apm-server","error":{"message":"flush failed (503): [503 Service Unavailable]"},"ecs.version":"1.6.0"}
...
----

To gain better insight into APM Server health and performance, consider enabling the monitoring feature by following the steps in <<apm-monitor-apm,Monitor APM Server>>.
When enabled, APM Server will additionally report a set of vital metrics to help you identify any performance degradation.

Pay careful attention to the next metric fields:

* `beats_stats.metrics.libbeat.output.events.active` that represents the number of buffered pending documents waiting to be ingested;
(_if this value is increasing rapidly it may indicate {es} backpressure_)
* `beats_stats.metrics.libbeat.output.events.acked` that represents the total number of documents that have been ingested successfully;
* `beats_stats.metrics.libbeat.output.events.failed` that represents the total number of documents that failed to ingest;
(_if this value is increasing rapidly it may indicate {es} backpressure_)
* `beats_stats.metrics.libbeat.output.events.toomany` that represents the number of documents that failed to ingest due to {es} responding with 429 Too many Requests;
(_if this value is increasing rapidly it may indicate {es} backpressure_)
* `beats_stats.output.elasticsearch.bulk_requests.available` that represents the number of bulk indexers available for making bulk index requests;
(_if this value is equal to 0 it may indicate {es} backpressure_)
* `beats_stats.output.elasticsearch.bulk_requests.completed` that represents the number of already completed bulk requests;
* `beats_stats.metrics.output.elasticsearch.indexers.active` that represents the number of active bulk indexers that are concurrently processing batches;

See {metricbeat-ref}/exported-fields-beat.html[{metricbeat} documentation] for the full list of exported metric fields.

One likely cause of excessive indexing pressure or rejected requests is undersized {es}. To mitigate this, follow the guidance in {ref}/rejected-requests.html[Rejected requests].

(Not recommended) If scaling {es} resources up is not an option, you can adjust the `flush_bytes`, `flush_interval`, `max_retries` and `timeout` settings described in <<apm-elasticsearch-output,Configure the Elasticsearch output>> to reduce APM Server indexing pressure. However, consider that increasing number of buffered documents and/or reducing retries may lead to a higher rate of dropped APM events. Down below a custom configuration example is listed where the number of default buffered documents is roughly doubled while {es} indexing retries are decreased simultaneously. This configuration provides a generic example and might not be applicable to your situation. Try adjusting the settings further to see what works for you.
[source,yaml]
----
output.elasticsearch:
flush_bytes: "2MB" # double the default value
flush_interval: "2s" # double the default value
max_retries: 1 # decrease the default value
timeout: 60 # decrease the default value
----
5 changes: 4 additions & 1 deletion docs/en/observability/apm/troubleshoot-apm.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ and processing and performance guidance.
* <<apm-common-response-codes>>
* <<apm-processing-and-performance>>
* <<apm-enable-apm-server-debugging>>
* <<apm-performance-diagnostic>>

For additional help with other APM components, see the links below.

Expand Down Expand Up @@ -54,4 +55,6 @@ include::apm-response-codes.asciidoc[]

include::processing-performance.asciidoc[]

include::{observability-docs-root}/docs/en/observability/apm/debugging.asciidoc[]
include::{observability-docs-root}/docs/en/observability/apm/debugging.asciidoc[]

include::apm-performance-diagnostic.asciidoc[]

0 comments on commit 406f676

Please sign in to comment.