Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make maximum delay of prober in its backoff configurable #1001

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

SaschaSchwarze0
Copy link

@SaschaSchwarze0 SaschaSchwarze0 commented Aug 20, 2024

Changes

We are running a larger installation with Knative Serving + Istio + NetIstio. What we observe is that the more services there are, the longer it takes to provision a new KService. This is fine, but at some point, we observe that times increase a lot and containing jumps. E. g. we measure many values around 21 seconds and around 35 seconds but rarely in between. The image is always available so the time it takes to get the Configuration ready is quite stable. The differences we see in the Route.

We think that this has to do with the prober which is containing an exponential back-off with a maximum delay of 30 seconds.

The following table shows how long the prober takes depending on how many tries it needs ignoring the time it actually needs to perform the probe itself = just summing up the delays:

Try BackOff Sum
0 0 0
1 0.05 0.05
2 0.1 0.15
3 0.2 0.35
4 0.4 0.75
5 0.8 1.55
6 1.6 3.15
7 3.2 6.35
8 6.4 12.75
9 12.8 25.55
10 25.6 51.15
11 30 81.15
12 30 111.15

Those numbers very well match what we observe. For example for those provisions that took roughly 21 seconds, it needed eight probes (12.75 s) and when it needed nine probes (25.55 s) the overall provisioning took around 35 seconds. And yes, we also see provisions that need a little over a minute where it probably needed ten tries to probe successfully.

--

While we obviously spend time in improving this with Istio tuning, we would also like to play with lower values for the maximum delay as we think that higher delays (5 seconds and more) are reached too early = already after 6.35 seconds, and compared to the overall load in our system, the number of probes is so minimal that we think that more failed probes would not cause any harm.

This PR therefore introduces a PROBE_MAX_RETRY_DELAY_SECONDS environment variable that - if set - will be used as maximum delay.

If you think that makes no sense as a general change, then just close it. If you think it should be exposed in another way, let me know.

/kind enhancement

Release Note

You can now set a PROBE_MAX_RETRY_DELAY_SECONDS environment variable on your networking component to define a custom maximum retry delay of the prober that will be used instead of the default of 30 seconds.

@knative-prow knative-prow bot requested review from ReToCode and skonto August 20, 2024 11:39
@knative-prow knative-prow bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 20, 2024
Copy link

knative-prow bot commented Aug 20, 2024

Hi @SaschaSchwarze0. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

)

func init() {
if val, ok := os.LookupEnv("PROBE_MAX_RETRY_DELAY_SECONDS"); ok {
Copy link
Contributor

@skonto skonto Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we document this? We could add it at the controller deployment config (net-istio repo) as well after:

        - name: CONFIG_LOGGING_NAME
          value: config-logging
        - name: CONFIG_OBSERVABILITY_NAME
          value: config-observability
        - name: ENABLE_SECRET_INFORMER_FILTERING_BY_CERT_UID
          value: "false"

What about other ingresses are they equally affected? 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any numbers about the correct value per number of services? That would be also interesting to document as a recommendation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this number should be related to the number of services. The number of services imo has an impact on how long - in our case - Istio will need to get the config synced to all Ingress gateways - and also depends on things like how Istio is tuned (like CPU for istiod), whether delta pushes are enabled for some parts or not, how large the mesh is over for example by using ambient or not etc.

The number we look at here is the maximum duration of the back-off in net-istio. As with exponential back-offs in the context of retries, I assume this was done having in mind to not overload things.

And here is where we at least do not see an overload for us. The two involved components are the net-istio-controller itself that originates the requests. Would for whetever reason be there many KIngresses that it reconciles and then must probe, then yes, reducing the probing max delay increases the load. But in our experience, that is rarely the case and has never been something near to a bottleneck. On the receiving side, the probe requests to the Istio ingress gateway. And there the percentage of requests that are actually probing requests are lik 0.0...1 % ( I have no exact number that I now looked at, but it is so low that it has no relevance at all).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about other ingresses are they equally affected? 🤔

I have no experience on them.

And yes, adding it to the deployment yaml can be done as a follow-on pr.

@skonto
Copy link
Contributor

skonto commented Sep 5, 2024

/ok-to-test

@knative-prow knative-prow bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 5, 2024
@skonto
Copy link
Contributor

skonto commented Sep 5, 2024

@SaschaSchwarze0 hi, thank you for the PR.

If you think that makes no sense as a general change, then just close it. If you think it should be exposed in another way, let me know.

@dprotaso any objections on this one? From a tech pov if it solves a problem I am fine, at the end of the day the default remains the same.

We think that this has to do with the prober which is containing an exponential back-off with a maximum delay of 30 seconds.

It would be nice to know how different values affect the large deployment to have at least some evidence.

@SaschaSchwarze0
Copy link
Author

SaschaSchwarze0 commented Sep 5, 2024

It would be nice to know how different values affect the large deployment to have at least some evidence.

This is a screenshot of one of our larger clusters with several thousands of KServices. It renders the duration between creationTimestamp and ready condition of a KService. Is always the same KService with the same image.

At 10:04, PROBE_MAX_RETRY_DELAY_SECONDS was set to 5 (before it was not set which means it was 30).

Unbenannt

The change obviously does not fix the spikes because that's when for example the Pod landed on a Node that did not have the image present or so. But we can see how it brings down the time whenever things are actually ready, but the prober previously waited a longer time to check that again.

@SaschaSchwarze0
Copy link
Author

@dprotaso @skonto just noticed this did not move forward. Anything else you need from me?

@skonto
Copy link
Contributor

skonto commented Nov 15, 2024

/lgtm
/approve

/hold for @ReToCode

@knative-prow knative-prow bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 15, 2024
@knative-prow knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Nov 15, 2024
Copy link

knative-prow bot commented Nov 15, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: SaschaSchwarze0, skonto

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow knative-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/enhancement lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants