-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enrich traces and request logs with Elasticsearch error details #3098
Comments
@joshdover I'm wondering where the I found the fleet-server logs here: https://overview.elastic-cloud.com/app/r/s/S2FUG For those ES connection errors I'm seeing this in |
We wrap the fleet-server/internal/pkg/api/router.go Lines 23 to 26 in a465d09
I believe from somewhere in fleet-server/internal/pkg/api/error.go Line 519 in a465d09
apm.TransactionFromContext(r.Context()).SetLabel('elasticsearch_error_type', 'foo')
These tcp/connection level errors aren't so obvious to classify. It could be a temporary network issue, an ES or proxy pod got killed or shutdown, or ES or the proxy timed out (though I would expect the proxy to return a 504 if ES times out). I think we can classify this type of error as a |
We should also enrich the actual status code, now there is only |
I believe this already exists on some |
We want to avoid being paged for Elasticsearch problems that the Fleet team cannot take any action on. In order to filter these alerts from our SLOs and/or route these pages to the Elasticsearch team, we need to enrich our observability data with more details about the Elasticsearch error that caused a request to fail. This will allow us to distinguish between real Fleet related problems that we can solve (eg. misconfigured ES URL or service token) from Elasticsearch or other infrastructure problems that we cannot take action on.
We should add a field that helps us do this to all traces and request logs.
TODO: fill in list of ES errors that we can safely ignore
The text was updated successfully, but these errors were encountered: