Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enrich traces and request logs with Elasticsearch error details #3098

Closed
joshdover opened this issue Nov 13, 2023 · 4 comments · Fixed by #3124
Closed

Enrich traces and request logs with Elasticsearch error details #3098

joshdover opened this issue Nov 13, 2023 · 4 comments · Fixed by #3124
Assignees
Labels
Project:Serverless Team:Fleet Label for the Fleet team

Comments

@joshdover
Copy link
Contributor

We want to avoid being paged for Elasticsearch problems that the Fleet team cannot take any action on. In order to filter these alerts from our SLOs and/or route these pages to the Elasticsearch team, we need to enrich our observability data with more details about the Elasticsearch error that caused a request to fail. This will allow us to distinguish between real Fleet related problems that we can solve (eg. misconfigured ES URL or service token) from Elasticsearch or other infrastructure problems that we cannot take action on.

We should add a field that helps us do this to all traces and request logs.

TODO: fill in list of ES errors that we can safely ignore

@joshdover joshdover added Team:Fleet Label for the Fleet team Project:Serverless labels Nov 13, 2023
@juliaElastic
Copy link
Contributor

juliaElastic commented Nov 14, 2023

@joshdover I'm wondering where the request type APM events are coming from, are they sent from fleet-server or proxy?
I'm asking because I haven't found any of this logging in fleet-server, only seeing StartTransaction calls with bulker.

I found the fleet-server logs here: https://overview.elastic-cloud.com/app/r/s/S2FUG

For those ES connection errors I'm seeing this in error.message: read tcp 10.2.4.8:49270->10.253.48.33:443: read: connection reset by peer

@joshdover
Copy link
Contributor Author

We wrap the chi router here with middleware provided by the APM go agent which automatically provides the default request transactions:

r := chi.NewRouter()
if tracer != nil {
r.Use(apmchiv5.Middleware(apmchiv5.WithTracer(tracer)))
}

I believe from somewhere in

zlog := hlog.FromRequest(r)
we could do something like apm.TransactionFromContext(r.Context()).SetLabel('elasticsearch_error_type', 'foo')

For those ES connection errors I'm seeing this in error.message: read tcp 10.2.4.8:49270->10.253.48.33:443: read: connection reset by peer

These tcp/connection level errors aren't so obvious to classify. It could be a temporary network issue, an ES or proxy pod got killed or shutdown, or ES or the proxy timed out (though I would expect the proxy to return a 504 if ES times out).

I think we can classify this type of error as a connection_reset and the decide at the SLO/alerting level what we will include or exclude as something we can take action on. That way we can make adjustments without having to deploy code changes.

@juliaElastic
Copy link
Contributor

We should also enrich the actual status code, now there is only transaction.result: HTTP 5xx, we might want to differentiate errors based on the status code.

@joshdover
Copy link
Contributor Author

We should also enrich the actual status code, now there is only transaction.result: HTTP 5xx, we might want to differentiate errors based on the status code.

I believe this already exists on some http.* field.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Project:Serverless Team:Fleet Label for the Fleet team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants