Make maximum delay of prober in its backoff configurable #1001

SaschaSchwarze0 · 2024-08-20T11:39:42Z

Changes

We are running a larger installation with Knative Serving + Istio + NetIstio. What we observe is that the more services there are, the longer it takes to provision a new KService. This is fine, but at some point, we observe that times increase a lot and containing jumps. E. g. we measure many values around 21 seconds and around 35 seconds but rarely in between. The image is always available so the time it takes to get the Configuration ready is quite stable. The differences we see in the Route.

We think that this has to do with the prober which is containing an exponential back-off with a maximum delay of 30 seconds.

The following table shows how long the prober takes depending on how many tries it needs ignoring the time it actually needs to perform the probe itself = just summing up the delays:

Try	BackOff	Sum
0	0	0
1	0.05	0.05
2	0.1	0.15
3	0.2	0.35
4	0.4	0.75
5	0.8	1.55
6	1.6	3.15
7	3.2	6.35
8	6.4	12.75
9	12.8	25.55
10	25.6	51.15
11	30	81.15
12	30	111.15

Those numbers very well match what we observe. For example for those provisions that took roughly 21 seconds, it needed eight probes (12.75 s) and when it needed nine probes (25.55 s) the overall provisioning took around 35 seconds. And yes, we also see provisions that need a little over a minute where it probably needed ten tries to probe successfully.

--

While we obviously spend time in improving this with Istio tuning, we would also like to play with lower values for the maximum delay as we think that higher delays (5 seconds and more) are reached too early = already after 6.35 seconds, and compared to the overall load in our system, the number of probes is so minimal that we think that more failed probes would not cause any harm.

This PR therefore introduces a PROBE_MAX_RETRY_DELAY_SECONDS environment variable that - if set - will be used as maximum delay.

If you think that makes no sense as a general change, then just close it. If you think it should be exposed in another way, let me know.

/kind enhancement

Release Note

You can now set a PROBE_MAX_RETRY_DELAY_SECONDS environment variable on your networking component to define a custom maximum retry delay of the prober that will be used instead of the default of 30 seconds.

knative-prow · 2024-08-20T11:39:52Z

Hi @SaschaSchwarze0. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

skonto · 2024-09-05T10:40:28Z

pkg/status/status.go

+)
+
+func init() {
+	if val, ok := os.LookupEnv("PROBE_MAX_RETRY_DELAY_SECONDS"); ok {


Should we document this? We could add it at the controller deployment config (net-istio repo) as well after:

- name: CONFIG_LOGGING_NAME value: config-logging - name: CONFIG_OBSERVABILITY_NAME value: config-observability - name: ENABLE_SECRET_INFORMER_FILTERING_BY_CERT_UID value: "false"

What about other ingresses are they equally affected? 🤔

Do you have any numbers about the correct value per number of services? That would be also interesting to document as a recommendation.

I am not sure if this number should be related to the number of services. The number of services imo has an impact on how long - in our case - Istio will need to get the config synced to all Ingress gateways - and also depends on things like how Istio is tuned (like CPU for istiod), whether delta pushes are enabled for some parts or not, how large the mesh is over for example by using ambient or not etc.

The number we look at here is the maximum duration of the back-off in net-istio. As with exponential back-offs in the context of retries, I assume this was done having in mind to not overload things.

And here is where we at least do not see an overload for us. The two involved components are the net-istio-controller itself that originates the requests. Would for whetever reason be there many KIngresses that it reconciles and then must probe, then yes, reducing the probing max delay increases the load. But in our experience, that is rarely the case and has never been something near to a bottleneck. On the receiving side, the probe requests to the Istio ingress gateway. And there the percentage of requests that are actually probing requests are lik 0.0...1 % ( I have no exact number that I now looked at, but it is so low that it has no relevance at all).

What about other ingresses are they equally affected? 🤔

I have no experience on them.

And yes, adding it to the deployment yaml can be done as a follow-on pr.

skonto · 2024-09-05T10:41:42Z

/ok-to-test

skonto · 2024-09-05T10:50:18Z

@SaschaSchwarze0 hi, thank you for the PR.

If you think that makes no sense as a general change, then just close it. If you think it should be exposed in another way, let me know.

@dprotaso any objections on this one? From a tech pov if it solves a problem I am fine, at the end of the day the default remains the same.

We think that this has to do with the prober which is containing an exponential back-off with a maximum delay of 30 seconds.

It would be nice to know how different values affect the large deployment to have at least some evidence.

SaschaSchwarze0 · 2024-09-05T12:01:17Z

It would be nice to know how different values affect the large deployment to have at least some evidence.

This is a screenshot of one of our larger clusters with several thousands of KServices. It renders the duration between creationTimestamp and ready condition of a KService. Is always the same KService with the same image.

At 10:04, PROBE_MAX_RETRY_DELAY_SECONDS was set to 5 (before it was not set which means it was 30).

The change obviously does not fix the spikes because that's when for example the Pod landed on a Node that did not have the image present or so. But we can see how it brings down the time whenever things are actually ready, but the prober previously waited a longer time to check that again.

SaschaSchwarze0 · 2024-10-28T13:31:24Z

@dprotaso @skonto just noticed this did not move forward. Anything else you need from me?

skonto · 2024-11-15T11:38:34Z

/lgtm
/approve

/hold for @ReToCode

knative-prow · 2024-11-15T11:38:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: SaschaSchwarze0, skonto

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [skonto]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Make maximum delay of prober in its backoff configurable

bd0f630

knative-prow bot added the kind/enhancement label Aug 20, 2024

knative-prow bot requested review from ReToCode and skonto August 20, 2024 11:39

knative-prow bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 20, 2024

skonto reviewed Sep 5, 2024

View reviewed changes

knative-prow bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 5, 2024

knative-prow bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 15, 2024

knative-prow bot assigned skonto Nov 15, 2024

knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Nov 15, 2024

knative-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make maximum delay of prober in its backoff configurable #1001

Make maximum delay of prober in its backoff configurable #1001

SaschaSchwarze0 commented Aug 20, 2024 •

edited

Loading

knative-prow bot commented Aug 20, 2024

skonto Sep 5, 2024 •

edited

Loading

skonto Sep 5, 2024

SaschaSchwarze0 Sep 5, 2024

SaschaSchwarze0 Sep 5, 2024

skonto commented Sep 5, 2024

skonto commented Sep 5, 2024

SaschaSchwarze0 commented Sep 5, 2024 •

edited

Loading

SaschaSchwarze0 commented Oct 28, 2024

skonto commented Nov 15, 2024

knative-prow bot commented Nov 15, 2024

Make maximum delay of prober in its backoff configurable #1001

Are you sure you want to change the base?

Make maximum delay of prober in its backoff configurable #1001

Conversation

SaschaSchwarze0 commented Aug 20, 2024 • edited Loading

Changes

knative-prow bot commented Aug 20, 2024

skonto Sep 5, 2024 • edited Loading

Choose a reason for hiding this comment

skonto Sep 5, 2024

Choose a reason for hiding this comment

SaschaSchwarze0 Sep 5, 2024

Choose a reason for hiding this comment

SaschaSchwarze0 Sep 5, 2024

Choose a reason for hiding this comment

skonto commented Sep 5, 2024

skonto commented Sep 5, 2024

SaschaSchwarze0 commented Sep 5, 2024 • edited Loading

SaschaSchwarze0 commented Oct 28, 2024

skonto commented Nov 15, 2024

knative-prow bot commented Nov 15, 2024

SaschaSchwarze0 commented Aug 20, 2024 •

edited

Loading

skonto Sep 5, 2024 •

edited

Loading

SaschaSchwarze0 commented Sep 5, 2024 •

edited

Loading