[DOCS] Provide a performance warning on use of custom doc ids in deduplication docs section #33494

geekpete · 2022-10-31T02:07:35Z

Describe the enhancement:

A section of the docs concerned with avoiding data duplication suggest to enable custom _id when indexing Beats documents:
https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-deduplication.html

Warn about the potential for performance impact of using a custom _id value and also how to identify seen performance impact that might be caused by the use of custom _id field.

Providing a custom id when indexing documents can have a sizeable indexing performance hit at scale, as Elasticsearch needs to check if the document _id already exists before indexing it and this lookup only occurs on primary shards which can look like a hotspot depending on shard configuration and how shards are balanced.
The cost also grows the more documents need to be searched for the existing id.

We specifically call out the performance costs of custom _id usage in our Tuning for Indexing Speed Howto guide in the Elasticsearch documentation:
https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html#_use_auto_generated_ids

We should also warn that this method to deduplicate will only be effective if the index you're checking for duplicates in doesn't roll over after the first copy of a document is successfully indexed but before any duplicate lookup can occur where the duplicate will not be found in the current index but will still existing within the cluster. This might be particularly important when reprocessing a backlog that might overlap some amount of potentially already processed data that might exist in an already rolled over index.

Lastly, while we're fixing this page, the mention that But if Filebeat shuts down during processing, or the connection is lost before events are acknowledged, you can end up with duplicate data. should be clarified to make it clear that Filebeat does have a graceful shutdown process to avoid this generally and that non-graceful shutdowns for whatever reason might more generally be the scenarios that cause duplicates.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-10-31T02:07:37Z

Pinging @elastic/obs-docs (Team:Docs)

botelastic · 2023-10-31T03:04:18Z

Hi!
We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

geekpete · 2023-10-31T03:58:28Z

Related elastic/elasticsearch#93455

botelastic · 2024-10-30T04:06:25Z

Hi!
We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

geekpete added the Team:Docs Label for the Observability docs team label Oct 31, 2022

botelastic bot added the Stalled label Oct 31, 2023

botelastic bot removed the Stalled label Oct 31, 2023

botelastic bot added the Stalled label Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DOCS] Provide a performance warning on use of custom doc ids in deduplication docs section #33494

[DOCS] Provide a performance warning on use of custom doc ids in deduplication docs section #33494

geekpete commented Oct 31, 2022

elasticmachine commented Oct 31, 2022

botelastic bot commented Oct 31, 2023

geekpete commented Oct 31, 2023

botelastic bot commented Oct 30, 2024

[DOCS] Provide a performance warning on use of custom doc ids in deduplication docs section #33494

[DOCS] Provide a performance warning on use of custom doc ids in deduplication docs section #33494

Comments

geekpete commented Oct 31, 2022

elasticmachine commented Oct 31, 2022

botelastic bot commented Oct 31, 2023

geekpete commented Oct 31, 2023

botelastic bot commented Oct 30, 2024