-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(activator): Don't cancel all probes on one probe failure #14303
Conversation
By using errgroup.WithContext, all probes are cancelled on the first error returned. This changes to use an errgroup without a context/cancellation. So all probes are allowed to run to completion and one failed probe does not cause all probes to exit. Ref knative#14200
Welcome @arsenetar! It looks like this is your first PR to knative/serving 🎉 |
Hi @arsenetar. Thanks for your PR. I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #14303 +/- ##
==========================================
- Coverage 86.23% 86.08% -0.15%
==========================================
Files 195 196 +1
Lines 14702 14783 +81
==========================================
+ Hits 12678 12726 +48
- Misses 1723 1749 +26
- Partials 301 308 +7
☔ View full report in Codecov by Sentry. |
LGTM
Yes, it would be great if we could have the tests. (Not necessary to change in this PR but #9540 added the errgroup. It seems that we should not use the errgroup for some other places as well such as |
@nak3 I noticed the same pattern in those other locations as you mentioned after changing this one, I think they should likely be changed to be the same as here now. I can add them to this PR if you would like. |
Let's do this in a separate PR
💯 |
- Add pod IP probe tests to directly test the pod IP probing behavior - Add test for no-probe optimization when all healthy - Add test for a pod returning an error to verify it does not block / cancel other probes (confirmed test with prior code which fails) - Update the fake roundtripper to support the new pod IP probe tests by introducing the Delay field to be used to delay the response. Default handling skips the delay if not set. - Add comment to errgroup change since this had been correct before and was incorrectly changed. (Additionally change to use original form.)
Tests have been added to cover this change, additionally added a comment in the code for awareness since this had been previously correct and was changed to incorrect behavior. |
/lgtm Thank you so much! |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test looks good - just some minor stuff
@@ -137,6 +153,17 @@ func (rt *FakeRoundTripper) RT(req *http.Request) (*http.Response, error) { | |||
resp = defaultRequestResponse() | |||
} | |||
|
|||
// Delay if set before sending response | |||
if resp.Delay.Seconds() != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of repeating this here should we just dedupe this code and put it at the beginning of the RT
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The could could be defined as a function ahead of the call, but it needs to be called after the resp
value is fetched which happens in two different code paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be defined as a function
let's do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is updated now.
Update to use a function for the delay to reduce duplicate code.
Cluster failed to start
/retest |
/lgtm thanks @arsenetar |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: arsenetar, dprotaso The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
- Backport activator fixes from knative#14303 and knative#14347 from 1.12 - Add custom patches for logs and probe durations - Update to go 1.20 - Add patch from knative#14022 - Add custom CI workflows
Proposed Changes
Currently
errgroup.WithContext()
is used to initialize the probeGroup, which causes all probes to be cancelled through the context on first first error returned. This can cause one bad pod to cause a fast failure essentially blocking all other pods from becoming healthy as the probes are canceled before they can return and update. There is no reason why a probe to one pod should cancel probes to other pods. This changes the errGroup creation to not have a cancelation function.Ref #14200
Release Note
NOTE: I can look at adding a test around this, but will have to spend some time looking at how the tests are setup.