-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make maximum delay of prober in its backoff configurable #1001
base: main
Are you sure you want to change the base?
Make maximum delay of prober in its backoff configurable #1001
Conversation
Hi @SaschaSchwarze0. Thanks for your PR. I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
) | ||
|
||
func init() { | ||
if val, ok := os.LookupEnv("PROBE_MAX_RETRY_DELAY_SECONDS"); ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we document this? We could add it at the controller deployment config (net-istio repo) as well after:
- name: CONFIG_LOGGING_NAME
value: config-logging
- name: CONFIG_OBSERVABILITY_NAME
value: config-observability
- name: ENABLE_SECRET_INFORMER_FILTERING_BY_CERT_UID
value: "false"
What about other ingresses are they equally affected? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any numbers about the correct value per number of services? That would be also interesting to document as a recommendation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this number should be related to the number of services. The number of services imo has an impact on how long - in our case - Istio will need to get the config synced to all Ingress gateways - and also depends on things like how Istio is tuned (like CPU for istiod), whether delta pushes are enabled for some parts or not, how large the mesh is over for example by using ambient or not etc.
The number we look at here is the maximum duration of the back-off in net-istio. As with exponential back-offs in the context of retries, I assume this was done having in mind to not overload things.
And here is where we at least do not see an overload for us. The two involved components are the net-istio-controller itself that originates the requests. Would for whetever reason be there many KIngresses that it reconciles and then must probe, then yes, reducing the probing max delay increases the load. But in our experience, that is rarely the case and has never been something near to a bottleneck. On the receiving side, the probe requests to the Istio ingress gateway. And there the percentage of requests that are actually probing requests are lik 0.0...1 % ( I have no exact number that I now looked at, but it is so low that it has no relevance at all).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about other ingresses are they equally affected? 🤔
I have no experience on them.
And yes, adding it to the deployment yaml can be done as a follow-on pr.
/ok-to-test |
@SaschaSchwarze0 hi, thank you for the PR.
@dprotaso any objections on this one? From a tech pov if it solves a problem I am fine, at the end of the day the default remains the same.
It would be nice to know how different values affect the large deployment to have at least some evidence. |
This is a screenshot of one of our larger clusters with several thousands of KServices. It renders the duration between creationTimestamp and ready condition of a KService. Is always the same KService with the same image. At 10:04, The change obviously does not fix the spikes because that's when for example the Pod landed on a Node that did not have the image present or so. But we can see how it brings down the time whenever things are actually ready, but the prober previously waited a longer time to check that again. |
/lgtm /hold for @ReToCode |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: SaschaSchwarze0, skonto The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Changes
We are running a larger installation with Knative Serving + Istio + NetIstio. What we observe is that the more services there are, the longer it takes to provision a new KService. This is fine, but at some point, we observe that times increase a lot and containing jumps. E. g. we measure many values around 21 seconds and around 35 seconds but rarely in between. The image is always available so the time it takes to get the Configuration ready is quite stable. The differences we see in the Route.
We think that this has to do with the prober which is containing an exponential back-off with a maximum delay of 30 seconds.
The following table shows how long the prober takes depending on how many tries it needs ignoring the time it actually needs to perform the probe itself = just summing up the delays:
Those numbers very well match what we observe. For example for those provisions that took roughly 21 seconds, it needed eight probes (12.75 s) and when it needed nine probes (25.55 s) the overall provisioning took around 35 seconds. And yes, we also see provisions that need a little over a minute where it probably needed ten tries to probe successfully.
--
While we obviously spend time in improving this with Istio tuning, we would also like to play with lower values for the maximum delay as we think that higher delays (5 seconds and more) are reached too early = already after 6.35 seconds, and compared to the overall load in our system, the number of probes is so minimal that we think that more failed probes would not cause any harm.
This PR therefore introduces a
PROBE_MAX_RETRY_DELAY_SECONDS
environment variable that - if set - will be used as maximum delay.If you think that makes no sense as a general change, then just close it. If you think it should be exposed in another way, let me know.
/kind enhancement
Release Note