Added support for reusing the webhook TLS certificate across different deployments to prevent cases where operator takes too long to start up #560

orishoshan · 2025-02-12T15:26:23Z

Description

Before this PR, the intents operator would recreate its webhook TLS certificate upon startup. During rollout, this meant that the old and new instances of the operator would "fight" to keep their certificate the one that is configured on the cluster. This would resolve itself after a short while, as the old instance went down. However, it resulted in many errors in the log, and in rare cases, a startup time of over 200 seconds and multiple restarts until the operator was finally healthy.

This PR makes the operator reuse the certificate, so both the old and new instances will use the same one. Importantly, this means that the new operator can start WATCHing resources immediately. Before, if the new operator switched the webhook certificate, it could not watch ClientIntents, since ClientIntents had a webhook set up on them, and this would break the webhook temporarily. This was the root cause for startup taking some time: the operator had to retry watching until it was set up as the webhook, and it's a race condition with retries, so in rare situations it could take a long time. This prevents this situation altogether and also, the operator is now only Ready only once it was able to sync its cache, indicating that reconciliation will immediately be functional.

References

otterize/helm-charts#282

…t deployments to prevent cases where operator takes too long to start up

…://github.com/otterize/intents-operator into orisho/intents_operator_shared_webhook_secret

Added support for reusing the webhook TLS certificate across differen…

2731a96

…t deployments to prevent cases where operator takes too long to start up

orishoshan requested a review from omris94 February 12, 2025 15:26

orishoshan mentioned this pull request Feb 12, 2025

Added support for reusing the webhook TLS certificate across different deployments to prevent cases where operator takes too long to start up otterize/helm-charts#282

Merged

omris94 approved these changes Feb 12, 2025

View reviewed changes

orishoshan added 2 commits February 12, 2025 18:06

update

aa2b644

update

8440f5f

orishoshan enabled auto-merge (squash) February 13, 2025 08:15

omris94 added 4 commits February 13, 2025 17:24

fixup

4ada6d4

fixup

c62b656

fixup

37c4603

fixup

3726764

omris94 mentioned this pull request Feb 13, 2025

Move intents-operator-webhook-secret.yaml to the release namespace otterize/helm-charts#283

Merged

1 task

orishoshan added 2 commits February 13, 2025 20:24

update

44a700a

Merge branch 'orisho/intents_operator_shared_webhook_secret' of https…

f53697e

…://github.com/otterize/intents-operator into orisho/intents_operator_shared_webhook_secret

orishoshan merged commit 3b82951 into main Feb 13, 2025
20 checks passed

orishoshan deleted the orisho/intents_operator_shared_webhook_secret branch February 13, 2025 18:41

github-actions bot locked and limited conversation to collaborators Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added support for reusing the webhook TLS certificate across different deployments to prevent cases where operator takes too long to start up #560

Added support for reusing the webhook TLS certificate across different deployments to prevent cases where operator takes too long to start up #560

orishoshan commented Feb 12, 2025 •

edited

Loading

Added support for reusing the webhook TLS certificate across different deployments to prevent cases where operator takes too long to start up #560

Added support for reusing the webhook TLS certificate across different deployments to prevent cases where operator takes too long to start up #560

Conversation

orishoshan commented Feb 12, 2025 • edited Loading

Description

References

orishoshan commented Feb 12, 2025 •

edited

Loading