Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Provide a performance warning on use of custom doc ids in deduplication docs section #33494

Open
geekpete opened this issue Oct 31, 2022 · 4 comments
Labels
Stalled Team:Docs Label for the Observability docs team

Comments

@geekpete
Copy link
Member

Describe the enhancement:

A section of the docs concerned with avoiding data duplication suggest to enable custom _id when indexing Beats documents:
https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-deduplication.html

Warn about the potential for performance impact of using a custom _id value and also how to identify seen performance impact that might be caused by the use of custom _id field.

Providing a custom id when indexing documents can have a sizeable indexing performance hit at scale, as Elasticsearch needs to check if the document _id already exists before indexing it and this lookup only occurs on primary shards which can look like a hotspot depending on shard configuration and how shards are balanced.
The cost also grows the more documents need to be searched for the existing id.

We specifically call out the performance costs of custom _id usage in our Tuning for Indexing Speed Howto guide in the Elasticsearch documentation:
https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html#_use_auto_generated_ids

We should also warn that this method to deduplicate will only be effective if the index you're checking for duplicates in doesn't roll over after the first copy of a document is successfully indexed but before any duplicate lookup can occur where the duplicate will not be found in the current index but will still existing within the cluster. This might be particularly important when reprocessing a backlog that might overlap some amount of potentially already processed data that might exist in an already rolled over index.

Lastly, while we're fixing this page, the mention that But if Filebeat shuts down during processing, or the connection is lost before events are acknowledged, you can end up with duplicate data. should be clarified to make it clear that Filebeat does have a graceful shutdown process to avoid this generally and that non-graceful shutdowns for whatever reason might more generally be the scenarios that cause duplicates.

@geekpete geekpete added the Team:Docs Label for the Observability docs team label Oct 31, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/obs-docs (Team:Docs)

@botelastic
Copy link

botelastic bot commented Oct 31, 2023

Hi!
We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

@botelastic botelastic bot added the Stalled label Oct 31, 2023
@geekpete
Copy link
Member Author

Related elastic/elasticsearch#93455

@botelastic botelastic bot removed the Stalled label Oct 31, 2023
@botelastic
Copy link

botelastic bot commented Oct 30, 2024

Hi!
We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

@botelastic botelastic bot added the Stalled label Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stalled Team:Docs Label for the Observability docs team
Projects
None yet
Development

No branches or pull requests

2 participants