Enrich traces and request logs with Elasticsearch error details #3098

joshdover · 2023-11-13T16:03:11Z

We want to avoid being paged for Elasticsearch problems that the Fleet team cannot take any action on. In order to filter these alerts from our SLOs and/or route these pages to the Elasticsearch team, we need to enrich our observability data with more details about the Elasticsearch error that caused a request to fail. This will allow us to distinguish between real Fleet related problems that we can solve (eg. misconfigured ES URL or service token) from Elasticsearch or other infrastructure problems that we cannot take action on.

We should add a field that helps us do this to all traces and request logs.

TODO: fill in list of ES errors that we can safely ignore

juliaElastic · 2023-11-14T08:46:14Z

@joshdover I'm wondering where the request type APM events are coming from, are they sent from fleet-server or proxy?
I'm asking because I haven't found any of this logging in fleet-server, only seeing StartTransaction calls with bulker.

I found the fleet-server logs here: https://overview.elastic-cloud.com/app/r/s/S2FUG

For those ES connection errors I'm seeing this in error.message: read tcp 10.2.4.8:49270->10.253.48.33:443: read: connection reset by peer

joshdover · 2023-11-14T09:06:22Z

We wrap the chi router here with middleware provided by the APM go agent which automatically provides the default request transactions:

fleet-server/internal/pkg/api/router.go

Lines 23 to 26 in a465d09

    
           r := chi.NewRouter() 
        
           if tracer != nil { 
        
           	r.Use(apmchiv5.Middleware(apmchiv5.WithTracer(tracer))) 
        
           }

I believe from somewhere in

fleet-server/internal/pkg/api/error.go

Line 519 in a465d09

zlog := hlog.FromRequest(r)

we could do something like apm.TransactionFromContext(r.Context()).SetLabel('elasticsearch_error_type', 'foo')

For those ES connection errors I'm seeing this in error.message: read tcp 10.2.4.8:49270->10.253.48.33:443: read: connection reset by peer

These tcp/connection level errors aren't so obvious to classify. It could be a temporary network issue, an ES or proxy pod got killed or shutdown, or ES or the proxy timed out (though I would expect the proxy to return a 504 if ES times out).

I think we can classify this type of error as a connection_reset and the decide at the SLO/alerting level what we will include or exclude as something we can take action on. That way we can make adjustments without having to deploy code changes.

juliaElastic · 2023-11-14T13:39:50Z

We should also enrich the actual status code, now there is only transaction.result: HTTP 5xx, we might want to differentiate errors based on the status code.

joshdover · 2023-11-14T15:48:29Z

We should also enrich the actual status code, now there is only transaction.result: HTTP 5xx, we might want to differentiate errors based on the status code.

I believe this already exists on some http.* field.

joshdover added Team:Fleet Label for the Fleet team Project:Serverless labels Nov 13, 2023

kpollich assigned michel-laterman Nov 21, 2023

michel-laterman mentioned this issue Nov 27, 2023

Add additional transaction labels with error details to requests. #3124

Merged

3 tasks

michel-laterman closed this as completed in #3124 Nov 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enrich traces and request logs with Elasticsearch error details #3098

Enrich traces and request logs with Elasticsearch error details #3098

joshdover commented Nov 13, 2023

juliaElastic commented Nov 14, 2023 •

edited

Loading

joshdover commented Nov 14, 2023

juliaElastic commented Nov 14, 2023

joshdover commented Nov 14, 2023

Enrich traces and request logs with Elasticsearch error details #3098

Enrich traces and request logs with Elasticsearch error details #3098

Comments

joshdover commented Nov 13, 2023

juliaElastic commented Nov 14, 2023 • edited Loading

joshdover commented Nov 14, 2023

juliaElastic commented Nov 14, 2023

joshdover commented Nov 14, 2023

juliaElastic commented Nov 14, 2023 •

edited

Loading