Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for reusing the webhook TLS certificate across different deployments to prevent cases where operator takes too long to start up #560

Merged
merged 9 commits into from
Feb 13, 2025

Conversation

orishoshan
Copy link
Collaborator

@orishoshan orishoshan commented Feb 12, 2025

Description

Before this PR, the intents operator would recreate its webhook TLS certificate upon startup. During rollout, this meant that the old and new instances of the operator would "fight" to keep their certificate the one that is configured on the cluster. This would resolve itself after a short while, as the old instance went down. However, it resulted in many errors in the log, and in rare cases, a startup time of over 200 seconds and multiple restarts until the operator was finally healthy.

This PR makes the operator reuse the certificate, so both the old and new instances will use the same one. Importantly, this means that the new operator can start WATCHing resources immediately. Before, if the new operator switched the webhook certificate, it could not watch ClientIntents, since ClientIntents had a webhook set up on them, and this would break the webhook temporarily. This was the root cause for startup taking some time: the operator had to retry watching until it was set up as the webhook, and it's a race condition with retries, so in rare situations it could take a long time. This prevents this situation altogether and also, the operator is now only Ready only once it was able to sync its cache, indicating that reconciliation will immediately be functional.

References

otterize/helm-charts#282

…t deployments to prevent cases where operator takes too long to start up
@orishoshan orishoshan enabled auto-merge (squash) February 13, 2025 08:15
@orishoshan orishoshan merged commit 3b82951 into main Feb 13, 2025
20 checks passed
@orishoshan orishoshan deleted the orisho/intents_operator_shared_webhook_secret branch February 13, 2025 18:41
@github-actions github-actions bot locked and limited conversation to collaborators Feb 13, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants