-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use poll timeout in es ctx #3986
Draft
juliaElastic
wants to merge
4
commits into
elastic:main
Choose a base branch
from
juliaElastic:es-timeout-long-poll
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+29
−1
Draft
Changes from 1 commit
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -337,6 +337,9 @@ func (ct *CheckinT) ProcessRequest(zlog zerolog.Logger, w http.ResponseWriter, r | |
actions, ackToken = convertActions(zlog, agent.Id, pendingActions) | ||
|
||
span, ctx := apm.StartSpan(r.Context(), "longPoll", "process") | ||
ctx, cancel := context.WithTimeout(ctx, pollDuration) | ||
defer cancel() | ||
|
||
if len(actions) == 0 { | ||
LOOP: | ||
for { | ||
|
@@ -368,7 +371,7 @@ func (ct *CheckinT) ProcessRequest(zlog zerolog.Logger, w http.ResponseWriter, r | |
actions = append(actions, *actionResp) | ||
break LOOP | ||
case <-longPoll.C: | ||
zlog.Trace().Msg("fire long poll") | ||
zlog.Debug().Str(logger.AgentID, agent.Id).Msg("fire long poll") | ||
break LOOP | ||
case <-tick.C: | ||
err := ct.bc.CheckIn(agent.Id, string(req.Status), req.Message, nil, rawComponents, nil, ver, unhealthyReason, false) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Doesn't seem likely that it gets stuck here as |
||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, evaluating what the timeouts and lifetimes of all the requests in here is actually challenging and I'm not sure if they are right.
This context is not obviously tied to the actual network requests, also at this point, we are passed auth, which should probably also respect a timeout and it's not clear that it is tied to the poll duration either.
What this does is cause us to hit the
ctx.Done()
block below, which triggers us to hit thect.writeResponse
call. That call is a network operation that should also have a timeout, but it can't be tied to this context because it is expired, so we'd need a different one.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking closer, this context deadline actually changes the checkin logic without setting a deadline on any of the underlying network operations or interactions with ES.
There is already a ticker for the long poll duration:
fleet-server/internal/pkg/api/handleCheckin.go
Lines 316 to 319 in 7ecbda1
It causes us to hit the
CheckIn
method here:fleet-server/internal/pkg/api/handleCheckin.go
Lines 374 to 379 in 7ecbda1
So all setting this context deadline does is get us to this block, but only if we aren't already in the CheckIn method:
fleet-server/internal/pkg/api/handleCheckin.go
Lines 345 to 356 in 7ecbda1
If both
<-ctx.Done()
and<-tick.C
are pending at the same time the Go runtime will randomly choose which case is taken, so the behavior here isn't even deterministic.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into it. Do you think there is a way to enforce the pollDuration on the underlying network requests? Maybe someone from the Control Plane team can spend some time on this, it seems to be the reason of some drones stuck in a failed checkin state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking closer, if we are in this loop, the only place where we interact directly with ES are in
processPolicy
and putting the deadline on the context solves that case:fleet-server/internal/pkg/api/handleCheckin.go
Line 364 in 7ecbda1
If I follow this all the way down the call stack to where the actual search call happens, the context should prevent us from waiting forever for a response:
fleet-server/internal/pkg/bulk/engine.go
Lines 568 to 574 in 7ecbda1
The one that is actually on the underlying ES network request is in the bulker run function:
fleet-server/internal/pkg/bulk/engine.go
Line 356 in 7ecbda1
The context for this appears to just be tied to
context.Background
in multiple places:fleet-server/internal/pkg/server/fleet.go
Lines 374 to 379 in 7ecbda1
fleet-server/internal/pkg/bulk/engine.go
Lines 166 to 182 in 7ecbda1
It does look like there may be a default 90s timeout on the underlying ES client, but I don't see this actually being called anywhere (possible I missed it):
fleet-server/internal/pkg/config/output.go
Lines 59 to 62 in 7ecbda1
There are lots of places that could be the problem, TBH I'd just add more logging or spans until we definitively know exactly where we get stuck when these 28+ minute checkins happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now that person is me, let's avoid context switching in someone else while we narrow down what is actually wrong and can evaluate the effort to fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could log out what's the actual timeout used, to see if the default is applied
could be logged out here
fleet-server/internal/pkg/es/client.go
Line 39 in 7ecbda1
logged it out and it seems to be 90s as defined