Skip to content

Commit

Permalink
address comments
Browse files Browse the repository at this point in the history
  • Loading branch information
1pkg committed Aug 1, 2024
1 parent f46701a commit 43be488
Showing 1 changed file with 22 additions and 14 deletions.
36 changes: 22 additions & 14 deletions docs/en/observability/apm/apm-performance-diagnostic.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

When {es} is under excessive load or indexing pressure, APM Server could experience the downstream backpressure when indexing new documents into {es}.
Most commonly, backpressure from {es} will manifest itself in the form of higher indexing latency and/or rejected requests, which in return could lead APM Server to deny incoming requests.
As a result APM agents connected to the affected APM Server will suffer from throttling and/or request timeout when shipping APM events.
As a result, APM agents connected to the affected APM Server will suffer from throttling and/or request timeout when shipping APM events.

To quickly identify possible issues try looking for similar error logs lines in APM Server logs:

Expand All @@ -19,25 +19,33 @@ To quickly identify possible issues try looking for similar error logs lines in
...
----

To gain better insight into APM Server health and performance, consider enabling the monitoring feature by following the steps in <<apm-monitor-apm-self-install,Monitor a Fleet-managed APM Server>>.
When enabled APM Server will additionally report a set of vital metrics to help you identify any performance degradation.
To gain better insight into APM Server health and performance, consider enabling the monitoring feature by following the steps in <<apm-monitor-apm,Monitor APM Server>>.
When enabled, APM Server will additionally report a set of vital metrics to help you identify any performance degradation.

Pay careful attention to the next metric fields:

* `beats_stats.metrics.libbeat.output.events.active` that represents the number of buffered pending documents waiting for indexing;
(_if this value is increasing rapidly it indicates {es} backpressure_)
* `beats_stats.metrics.libbeat.output.events.acked` that represents the number of indexing operations that have completed successfully;
* `beats_stats.metrics.libbeat.output.events.failed` that represents the number of indexing operations that failed, it includes all failures;
(_if this value is increasing rapidly it indicates {es} backpressure_)
* `beats_stats.metrics.libbeat.output.events.toomany` that represents the number of indexing operations that failed due to {es} responding with 429 Too many Requests;
(_if this value is increasing rapidly it indicates {es} backpressure_)
* `beats_stats.metrics.libbeat.output.events.active` that represents the number of buffered pending documents waiting to be ingested;
(_if this value is increasing rapidly it may indicate {es} backpressure_)
* `beats_stats.metrics.libbeat.output.events.acked` that represents the total number of documents that have been ingested successfully;
* `beats_stats.metrics.libbeat.output.events.failed` that represents the total number of documents that failed to ingest;
(_if this value is increasing rapidly it may indicate {es} backpressure_)
* `beats_stats.metrics.libbeat.output.events.toomany` that represents the number of documents that failed to ingest due to {es} responding with 429 Too many Requests;
(_if this value is increasing rapidly it may indicate {es} backpressure_)
* `beats_stats.output.elasticsearch.bulk_requests.available` that represents the number of bulk indexers available for making bulk index requests;
(_if this value is equal to 0 it indicates {es} backpressure_)
(_if this value is equal to 0 it may indicate {es} backpressure_)
* `beats_stats.output.elasticsearch.bulk_requests.completed` that represents the number of already completed bulk requests;
* `beats_stats.metrics.output.elasticsearch.indexers.active` that represents the number of active bulk indexers that are concurrently processing batches;

See https://www.elastic.co/guide/en/beats/metricbeat/current/exported-fields-beat.html[{metricbeat} documentation] for the full list of exported metric fields.
See {metricbeat-ref}/exported-fields-beat.html[{metricbeat} documentation] for the full list of exported metric fields.

One likely cause of excessive indexing pressure or rejected requests is undersized {es}. To mitigate this, follow the guidance in {ref}/rejected-requests.html[Rejected requests].
If scaling {es} resources up is not an option, you can try to workaround by adjusting `flush_bytes`, `flush_interval`, `max_retries` and `timeout` settings described in <<apm-elasticsearch-output,Configure the Elasticsearch output>> to reduce APM Server indexing pressure.
However, consider that increasing number of buffered documents and/or reducing retries may lead to a higher rate of dropped APM events.

If scaling {es} resources up is not an option, you can adjust the `flush_bytes`, `flush_interval`, `max_retries` and `timeout` settings described in <<apm-elasticsearch-output,Configure the Elasticsearch output>> to reduce APM Server indexing pressure. However, consider that increasing number of buffered documents and/or reducing retries may lead to a higher rate of dropped APM events. Down bellow a custom configuration example is listed where the number of default buffered documents is roughly doubled while {es} indexing retries are decreased simultaneously. This configuration provides a generic example and might not be applicable to your situation. Try adjusting the settings further to see what works for you.
[source,yaml]
----
output.elasticsearch:
flush_bytes: "2MB" # double the default value
flush_interval: "2s" # double the default value
max_retries: 1 # decrease the default value
timeout: 60 # decrease the default value
----

0 comments on commit 43be488

Please sign in to comment.