Added support for reusing the webhook TLS certificate across different deployments to prevent cases where operator takes too long to start up #560
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Before this PR, the intents operator would recreate its webhook TLS certificate upon startup. During rollout, this meant that the old and new instances of the operator would "fight" to keep their certificate the one that is configured on the cluster. This would resolve itself after a short while, as the old instance went down. However, it resulted in many errors in the log, and in rare cases, a startup time of over 200 seconds and multiple restarts until the operator was finally healthy.
This PR makes the operator reuse the certificate, so both the old and new instances will use the same one. Importantly, this means that the new operator can start WATCHing resources immediately. Before, if the new operator switched the webhook certificate, it could not watch ClientIntents, since ClientIntents had a webhook set up on them, and this would break the webhook temporarily. This was the root cause for startup taking some time: the operator had to retry watching until it was set up as the webhook, and it's a race condition with retries, so in rare situations it could take a long time. This prevents this situation altogether and also, the operator is now only Ready only once it was able to sync its cache, indicating that reconciliation will immediately be functional.
References
otterize/helm-charts#282