Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(activator): Don't cancel all probes on one probe failure #14303

Merged
merged 3 commits into from
Sep 12, 2023

Conversation

arsenetar
Copy link
Contributor

@arsenetar arsenetar commented Aug 28, 2023

Proposed Changes

Currently errgroup.WithContext() is used to initialize the probeGroup, which causes all probes to be cancelled through the context on first first error returned. This can cause one bad pod to cause a fast failure essentially blocking all other pods from becoming healthy as the probes are canceled before they can return and update. There is no reason why a probe to one pod should cancel probes to other pods. This changes the errGroup creation to not have a cancelation function.

Ref #14200

Release Note

Activator no longer cancels all probes when one fails

NOTE: I can look at adding a test around this, but will have to spend some time looking at how the tests are setup.

By using errgroup.WithContext, all probes are cancelled on the first
error returned.  This changes to use an errgroup without a
context/cancellation.  So all probes are allowed to run to completion
and one failed probe does not cause all probes to exit.

Ref knative#14200
@knative-prow
Copy link

knative-prow bot commented Aug 28, 2023

Welcome @arsenetar! It looks like this is your first PR to knative/serving 🎉

@knative-prow knative-prow bot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 28, 2023
@knative-prow
Copy link

knative-prow bot commented Aug 28, 2023

Hi @arsenetar. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@nak3 nak3 added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 29, 2023
@codecov
Copy link

codecov bot commented Aug 29, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: -0.15% ⚠️

Comparison is base (43f7526) 86.23% compared to head (ba249c9) 86.08%.
Report is 43 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #14303      +/-   ##
==========================================
- Coverage   86.23%   86.08%   -0.15%     
==========================================
  Files         195      196       +1     
  Lines       14702    14783      +81     
==========================================
+ Hits        12678    12726      +48     
- Misses       1723     1749      +26     
- Partials      301      308       +7     
Files Changed Coverage Δ
pkg/activator/net/revision_backends.go 92.81% <100.00%> (+0.04%) ⬆️

... and 4 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nak3
Copy link
Contributor

nak3 commented Aug 29, 2023

LGTM

NOTE: I can look at adding a test around this, but will have to spend some time looking at how the tests are setup.

Yes, it would be great if we could have the tests.

(Not necessary to change in this PR but #9540 added the errgroup. It seems that we should not use the errgroup for some other places as well such as pkg/autoscaler/metrics/stats_scraper.go.)

@arsenetar
Copy link
Contributor Author

@nak3 I noticed the same pattern in those other locations as you mentioned after changing this one, I think they should likely be changed to be the same as here now. I can add them to this PR if you would like.

@dprotaso
Copy link
Member

dprotaso commented Aug 29, 2023

(Not necessary to change in this PR but #9540 added the errgroup. It seems that we should not use the errgroup for some other places as well such as pkg/autoscaler/metrics/stats_scraper.go.)

Let's do this in a separate PR

Yes, it would be great if we could have the tests.

💯

- Add pod IP probe tests to directly test the pod IP probing behavior
  - Add test for no-probe optimization when all healthy
  - Add test for a pod returning an error to verify it does not block /
    cancel other probes (confirmed test with prior code which fails)
- Update the fake roundtripper to support the new pod IP probe tests by
  introducing the Delay field to be used to delay the response.  Default
  handling skips the delay if not set.
- Add comment to errgroup change since this had been correct before and
  was incorrectly changed. (Additionally change to use original form.)
@knative-prow knative-prow bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 5, 2023
@arsenetar
Copy link
Contributor Author

Tests have been added to cover this change, additionally added a comment in the code for awareness since this had been previously correct and was changed to incorrect behavior.

@nak3
Copy link
Contributor

nak3 commented Sep 6, 2023

/lgtm

Thank you so much!

@knative-prow knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Sep 6, 2023
@ReToCode
Copy link
Member

ReToCode commented Sep 6, 2023

/retest

Copy link
Member

@dprotaso dprotaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test looks good - just some minor stuff

@@ -137,6 +153,17 @@ func (rt *FakeRoundTripper) RT(req *http.Request) (*http.Response, error) {
resp = defaultRequestResponse()
}

// Delay if set before sending response
if resp.Delay.Seconds() != 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of repeating this here should we just dedupe this code and put it at the beginning of the RT method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The could could be defined as a function ahead of the call, but it needs to be called after the resp value is fetched which happens in two different code paths.

Copy link
Member

@dprotaso dprotaso Sep 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be defined as a function

let's do that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is updated now.

pkg/activator/testing/roundtripper.go Outdated Show resolved Hide resolved
Update to use a function for the delay to reduce duplicate code.
@knative-prow knative-prow bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 11, 2023
@dprotaso
Copy link
Member

Cluster failed to start

finished with error: All cluster resources were brought up, but: only 11 nodes out of 12 have registered; cluster may be unhealthy.\n"

/retest

@dprotaso
Copy link
Member

/lgtm
/approve

thanks @arsenetar

@knative-prow knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Sep 11, 2023
@knative-prow
Copy link

knative-prow bot commented Sep 11, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: arsenetar, dprotaso

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow knative-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 11, 2023
@knative-prow knative-prow bot merged commit 997d8ef into knative:main Sep 12, 2023
63 checks passed
arsenetar added a commit to coreweave/serving that referenced this pull request Dec 19, 2023
- Backport activator fixes from
  knative#14303 and
  knative#14347 from 1.12
- Add custom patches for logs and probe durations
- Update to go 1.20
- Add patch from knative#14022
- Add custom CI workflows
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/autoscale area/networking lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants