use poll timeout in es ctx #3986

juliaElastic · 2024-10-08T12:45:59Z

What is the problem this PR solves?

// Please do not just reference an issue. Explain WHAT the problem this PR solves here.

How does this PR solve the problem?

// Explain HOW you solved the problem in your code. It is possible that during PR reviews this changes and then this section should be updated.

How to test this PR locally

Design Checklist

I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool

Related issues

mergify · 2024-10-08T12:46:50Z

This pull request does not have a backport label. Could you fix it @juliaElastic? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-./d./d is the label to automatically backport to the 8./d branch. /d is the digit

mergify · 2024-10-08T12:46:51Z

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

cmacknz · 2024-10-08T13:49:39Z

internal/pkg/api/handleCheckin.go

@@ -337,6 +337,9 @@ func (ct *CheckinT) ProcessRequest(zlog zerolog.Logger, w http.ResponseWriter, r
 	actions, ackToken = convertActions(zlog, agent.Id, pendingActions)

 	span, ctx := apm.StartSpan(r.Context(), "longPoll", "process")
+	ctx, cancel := context.WithTimeout(ctx, pollDuration)


Interesting, evaluating what the timeouts and lifetimes of all the requests in here is actually challenging and I'm not sure if they are right.

This context is not obviously tied to the actual network requests, also at this point, we are passed auth, which should probably also respect a timeout and it's not clear that it is tied to the poll duration either.

What this does is cause us to hit the ctx.Done() block below, which triggers us to hit the ct.writeResponse call. That call is a network operation that should also have a timeout, but it can't be tied to this context because it is expired, so we'd need a different one.

Looking closer, this context deadline actually changes the checkin logic without setting a deadline on any of the underlying network operations or interactions with ES.

There is already a ticker for the long poll duration:

fleet-server/internal/pkg/api/handleCheckin.go

Lines 316 to 319 in 7ecbda1

// Chill out for a bit. Long poll.

longPoll := time.NewTicker(pollDuration)

defer longPoll.Stop()

It causes us to hit the CheckIn method here:

fleet-server/internal/pkg/api/handleCheckin.go

Lines 374 to 379 in 7ecbda1

case <-tick.C:

err := ct.bc.CheckIn(agent.Id, string(req.Status), req.Message, nil, rawComponents, nil, ver, unhealthyReason, false)

if err != nil {

zlog.Error().Err(err).Str(logger.AgentID, agent.Id).Msg("checkin failed")

}

}

So all setting this context deadline does is get us to this block, but only if we aren't already in the CheckIn method:

fleet-server/internal/pkg/api/handleCheckin.go

Lines 345 to 356 in 7ecbda1

case <-ctx.Done():

defer span.End()

// If the request context is canceled, the API server is shutting down.

// We want to immediately stop the long-poll and return a 200 with the ackToken and no actions.

if errors.Is(ctx.Err(), context.Canceled) {

resp := CheckinResponse{

AckToken: &ackToken,

Action: "checkin",

}

return ct.writeResponse(zlog, w, r, agent, resp)

}

return ctx.Err()

If both <-ctx.Done() and <-tick.C are pending at the same time the Go runtime will randomly choose which case is taken, so the behavior here isn't even deterministic.

Thanks for looking into it. Do you think there is a way to enforce the pollDuration on the underlying network requests? Maybe someone from the Control Plane team can spend some time on this, it seems to be the reason of some drones stuck in a failed checkin state.

Looking closer, if we are in this loop, the only place where we interact directly with ES are in processPolicy and putting the deadline on the context solves that case:

fleet-server/internal/pkg/api/handleCheckin.go

Line 364 in 7ecbda1

actionResp, err := processPolicy(ctx, zlog, ct.bulker, agent.Id, policy)

If I follow this all the way down the call stack to where the actual search call happens, the context should prevent us from waiting forever for a response:

fleet-server/internal/pkg/bulk/engine.go

Lines 568 to 574 in 7ecbda1

func (b *Bulker) dispatch(ctx context.Context, blk *bulkT) respT {

start := time.Now()

// Dispatch to bulk Run loop

select {

case b.ch <- blk:

case <-ctx.Done():

The one that is actually on the underlying ES network request is in the bulker run function:

fleet-server/internal/pkg/bulk/engine.go

Line 356 in 7ecbda1

if err := b.flushQueue(ctx, w, *q); err != nil {

The context for this appears to just be tied to context.Background in multiple places:

fleet-server/internal/pkg/server/fleet.go

Lines 374 to 379 in 7ecbda1

// Bulker is started in its own context and managed in the scope of this function. This is done so

// when the `ctx` is cancelled, the bulker will remain executing until this function exits.

// This allows the child subsystems to continue to write to the data store while tearing down.

bulkCtx, bulkCancel := context.WithCancel(context.Background())

defer bulkCancel()

fleet-server/internal/pkg/bulk/engine.go

Lines 166 to 182 in 7ecbda1

bulkCtx, bulkCancel := context.WithCancel(context.Background())

es, err := b.createRemoteEsClient(bulkCtx, outputName, outputMap)

if err != nil {

defer bulkCancel()

return nil, hasConfigChanged, err

}

// starting a new bulker to create/update API keys for remote ES output

newBulker := NewBulker(es, b.tracer)

newBulker.cancelFn = bulkCancel

b.updateBulkerMap(outputName, newBulker)

errCh := make(chan error)

go func() {

runFunc := func() (err error) {

zlog.Debug().Str(logger.PolicyOutputName, outputName).Msg("Bulker started")

return newBulker.Run(bulkCtx)

It does look like there may be a default 90s timeout on the underlying ES client, but I don't see this actually being called anywhere (possible I missed it):

fleet-server/internal/pkg/config/output.go

Lines 59 to 62 in 7ecbda1

func (c *Elasticsearch) InitDefaults() {

c.Protocol = schemeHTTP

c.Hosts = []string{"localhost:9200"}

c.Timeout = 90 * time.Second

There are lots of places that could be the problem, TBH I'd just add more logging or spans until we definitively know exactly where we get stuck when these 28+ minute checkins happen.

Maybe someone from the Control Plane team can spend some time on this, it seems to be the reason of some drones stuck in a failed checkin state.

For now that person is me, let's avoid context switching in someone else while we narrow down what is actually wrong and can evaluate the effort to fix it.

we could log out what's the actual timeout used, to see if the default is applied
could be logged out here

fleet-server/internal/pkg/es/client.go

Line 39 in 7ecbda1

Int("cluster.maxConnsPersHost", mcph).

Dur("cluster.timeout", cfg.Output.Elasticsearch.Timeout).

logged it out and it seems to be 90s as defined

blakerouse · 2024-10-10T00:07:31Z

internal/pkg/api/handleCheckin.go

@@ -337,15 +337,20 @@ func (ct *CheckinT) ProcessRequest(zlog zerolog.Logger, w http.ResponseWriter, r
 	actions, ackToken = convertActions(zlog, agent.Id, pendingActions)

 	span, ctx := apm.StartSpan(r.Context(), "longPoll", "process")
+	// ctx, cancel := context.WithTimeout(ctx, pollDuration)
+	// defer cancel()
+
 	if len(actions) == 0 {
 	LOOP:
 		for {
 			select {
 			case <-ctx.Done():


Currently looking at this the only time this case would be hit would be if the client closes there connection to Fleet Server. As the span, ctx := apm.StartSpan(r.Context(), ... is being used as the context here, so then this section writes the response. Looking at this code, it shouldn't even write the response. If the context is cancelled then that means the client is no longer connected.

the logic of returning CheckinResponse here was added in this pr: https://github.com/elastic/fleet-server/pull/3165/files#diff-e0c02bac8d151e9941eedd5ef643441665ee3d2f78baf42c121edd45dee08ded

Can the context be cancelled only by the client here? I'm wondering if the writeResponse is successful, is it correct to return the AckToken without actions? I'm trying to find where the AckToken is persisted back to ES.
I think it's persisted in action_seq_no field on a successful checkin here:

fleet-server/internal/pkg/checkin/bulk.go

Line 263 in 3fd2a48

fields[dl.FieldActionSeqNo] = pendingData.extra.seqNo

I'm wondering if there is any retry if the agent /acks request fails or fleet-server fails to persist the action result. I've seen some stuck upgrades where the ack failed and it was never retried.
Though when testing this locally with a simulated error instead of writing action result, I see the retries happening.

blakerouse · 2024-10-10T00:08:47Z

internal/pkg/api/handleCheckin.go

@@ -360,6 +365,7 @@ func (ct *CheckinT) ProcessRequest(zlog zerolog.Logger, w http.ResponseWriter, r
 				actions = append(actions, acs...)
 				break LOOP
 			case policy := <-sub.Output():
+				zlog.Debug().Str(logger.AgentID, agent.Id).Msg("SCALEDEBUG new policy")
 				actionResp, err := processPolicy(ctx, zlog, ct.bulker, agent.Id, policy)


Possible that it gets stuck here processing the policy, as it doesn't create its own context and timeout after a period of time. This is still using the same context of the request connection.

blakerouse · 2024-10-10T00:10:31Z

internal/pkg/api/handleCheckin.go

@@ -368,7 +374,7 @@ func (ct *CheckinT) ProcessRequest(zlog zerolog.Logger, w http.ResponseWriter, r
 				actions = append(actions, *actionResp)
 				break LOOP
 			case <-longPoll.C:
-				zlog.Trace().Msg("fire long poll")
+				zlog.Debug().Str(logger.AgentID, agent.Id).Msg("fire long poll")
 				break LOOP
 			case <-tick.C:
 				err := ct.bc.CheckIn(agent.Id, string(req.Status), req.Message, nil, rawComponents, nil, ver, unhealthyReason, false)


Doesn't seem likely that it gets stuck here as ct.bc.CheckIn just grabs a lock and adds to a map. But again it could be possible that there is a dead lock here and that lock is held and never freed. Shouldn't be ruled out.

juliaElastic · 2024-10-22T14:14:11Z

The long poll issue is reproduced again here: https://github.com/elastic/ingest-dev/issues/3783#issuecomment-2429301669
I'm not seeing any of these added logs showing up, and it still seems to be the issue with the long running ES request.

elastic-sonarqube · 2024-10-28T15:02:12Z

Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
89.3% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

use poll timeout in es ctx

2add88e

mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Oct 8, 2024

cmacknz reviewed Oct 8, 2024

View reviewed changes

Add some SCALEDEBUG logs

780a146

blakerouse reviewed Oct 10, 2024

View reviewed changes

add agent id to logs

5c83def

debug logs in fleet.go

ed72b1c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use poll timeout in es ctx #3986

use poll timeout in es ctx #3986

juliaElastic commented Oct 8, 2024

mergify bot commented Oct 8, 2024

mergify bot commented Oct 8, 2024

cmacknz Oct 8, 2024

cmacknz Oct 8, 2024

juliaElastic Oct 8, 2024

cmacknz Oct 8, 2024

cmacknz Oct 8, 2024

juliaElastic Oct 9, 2024 •

edited

Loading

blakerouse Oct 10, 2024

juliaElastic Oct 25, 2024 •

edited

Loading

blakerouse Oct 10, 2024

blakerouse Oct 10, 2024

juliaElastic commented Oct 22, 2024

elastic-sonarqube bot commented Oct 28, 2024

	// Chill out for a bit. Long poll.
	longPoll := time.NewTicker(pollDuration)
	defer longPoll.Stop()

	case <-tick.C:
	err := ct.bc.CheckIn(agent.Id, string(req.Status), req.Message, nil, rawComponents, nil, ver, unhealthyReason, false)
	if err != nil {
	zlog.Error().Err(err).Str(logger.AgentID, agent.Id).Msg("checkin failed")
	}
	}

	case <-ctx.Done():
	defer span.End()
	// If the request context is canceled, the API server is shutting down.
	// We want to immediately stop the long-poll and return a 200 with the ackToken and no actions.
	if errors.Is(ctx.Err(), context.Canceled) {
	resp := CheckinResponse{
	AckToken: &ackToken,
	Action: "checkin",
	}
	return ct.writeResponse(zlog, w, r, agent, resp)
	}
	return ctx.Err()

	func (b Bulker) dispatch(ctx context.Context, blk bulkT) respT {
	start := time.Now()

	// Dispatch to bulk Run loop
	select {
	case b.ch <- blk:
	case <-ctx.Done():

	// Bulker is started in its own context and managed in the scope of this function. This is done so
	// when the `ctx` is cancelled, the bulker will remain executing until this function exits.
	// This allows the child subsystems to continue to write to the data store while tearing down.
	bulkCtx, bulkCancel := context.WithCancel(context.Background())
	defer bulkCancel()

	bulkCtx, bulkCancel := context.WithCancel(context.Background())
	es, err := b.createRemoteEsClient(bulkCtx, outputName, outputMap)
	if err != nil {
	defer bulkCancel()
	return nil, hasConfigChanged, err
	}
	// starting a new bulker to create/update API keys for remote ES output
	newBulker := NewBulker(es, b.tracer)
	newBulker.cancelFn = bulkCancel

	b.updateBulkerMap(outputName, newBulker)

	errCh := make(chan error)
	go func() {
	runFunc := func() (err error) {
	zlog.Debug().Str(logger.PolicyOutputName, outputName).Msg("Bulker started")
	return newBulker.Run(bulkCtx)

	func (c *Elasticsearch) InitDefaults() {
	c.Protocol = schemeHTTP
	c.Hosts = []string{"localhost:9200"}
	c.Timeout = 90 * time.Second

use poll timeout in es ctx #3986

Are you sure you want to change the base?

use poll timeout in es ctx #3986

Conversation

juliaElastic commented Oct 8, 2024

What is the problem this PR solves?

How does this PR solve the problem?

How to test this PR locally

Design Checklist

Checklist

Related issues

mergify bot commented Oct 8, 2024

mergify bot commented Oct 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juliaElastic Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juliaElastic Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juliaElastic commented Oct 22, 2024

elastic-sonarqube bot commented Oct 28, 2024

Quality Gate passed

juliaElastic Oct 9, 2024 •

edited

Loading

juliaElastic Oct 25, 2024 •

edited

Loading