You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Warn about the potential for performance impact of using a custom _id value and also how to identify seen performance impact that might be caused by the use of custom _id field.
Providing a custom id when indexing documents can have a sizeable indexing performance hit at scale, as Elasticsearch needs to check if the document _id already exists before indexing it and this lookup only occurs on primary shards which can look like a hotspot depending on shard configuration and how shards are balanced.
The cost also grows the more documents need to be searched for the existing id.
We should also warn that this method to deduplicate will only be effective if the index you're checking for duplicates in doesn't roll over after the first copy of a document is successfully indexed but before any duplicate lookup can occur where the duplicate will not be found in the current index but will still existing within the cluster. This might be particularly important when reprocessing a backlog that might overlap some amount of potentially already processed data that might exist in an already rolled over index.
Lastly, while we're fixing this page, the mention that But if Filebeat shuts down during processing, or the connection is lost before events are acknowledged, you can end up with duplicate data. should be clarified to make it clear that Filebeat does have a graceful shutdown process to avoid this generally and that non-graceful shutdowns for whatever reason might more generally be the scenarios that cause duplicates.
The text was updated successfully, but these errors were encountered:
Hi!
We just realized that we haven't looked into this issue in a while. We're sorry!
We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!
Hi!
We just realized that we haven't looked into this issue in a while. We're sorry!
We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!
Describe the enhancement:
A section of the docs concerned with avoiding data duplication suggest to enable custom
_id
when indexing Beats documents:https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-deduplication.html
Warn about the potential for performance impact of using a custom
_id
value and also how to identify seen performance impact that might be caused by the use of custom_id
field.Providing a custom id when indexing documents can have a sizeable indexing performance hit at scale, as Elasticsearch needs to check if the document _id already exists before indexing it and this lookup only occurs on primary shards which can look like a hotspot depending on shard configuration and how shards are balanced.
The cost also grows the more documents need to be searched for the existing id.
We specifically call out the performance costs of custom _id usage in our Tuning for Indexing Speed Howto guide in the Elasticsearch documentation:
https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html#_use_auto_generated_ids
We should also warn that this method to deduplicate will only be effective if the index you're checking for duplicates in doesn't roll over after the first copy of a document is successfully indexed but before any duplicate lookup can occur where the duplicate will not be found in the current index but will still existing within the cluster. This might be particularly important when reprocessing a backlog that might overlap some amount of potentially already processed data that might exist in an already rolled over index.
Lastly, while we're fixing this page, the mention that
But if Filebeat shuts down during processing, or the connection is lost before events are acknowledged, you can end up with duplicate data.
should be clarified to make it clear that Filebeat does have a graceful shutdown process to avoid this generally and thatnon-graceful
shutdowns for whatever reason might more generally be the scenarios that cause duplicates.The text was updated successfully, but these errors were encountered: