Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: tls policy status to state of the world tasks #885

Merged
merged 16 commits into from
Oct 9, 2024

Conversation

KevFan
Copy link
Contributor

@KevFan KevFan commented Sep 27, 2024

Description

Closes: #824

  • Refactor the TLSPolicy status code to workflow tasks
  • Added a new MissingDependency status of TLS Policy for if CertManager is not installed

Verification

  • Passing integration tests should be sufficient to ensure current functionality around TLSPolicy Status still works as intended
  • To verify new MissingDependency TLSPolicy status
    • Create cluster without CertManager and run operator
    make local-env-setup
    # Delete CertManager CRDs to minic cert manager is uninstalled
    kubectl delete crd certificaterequests.cert-manager.io 
    kubectl delete crd certificates.cert-manager.io
    kubectl delete crd challenges.acme.cert-manager.io
    kubectl delete crd clusterissuers.cert-manager.io
    kubectl delete crd issuers.cert-manager.io
    kubectl delete crd orders.acme.cert-manager.io
    # Run operator
    make run
    
    • Create kuadrant-system namespace and tls policiy
    # Create TLSPolicy
    kubectl apply -f - <<EOF
    apiVersion: kuadrant.io/v1alpha1
    kind: TLSPolicy
    metadata:
      name: gw-tls
    spec:
      targetRef:
        group: gateway.networking.k8s.io
        kind: Gateway
        name: istio-ingressgateway
      issuerRef:
        group: cert-manager.io
        kind: ClusterIssuer
        name: selfsigned-issuer
    EOF
    • Get TLSPolicy status
    kubectl get tlspolicy -A -o yaml | yq '.items[].status'
    • TLSPolicy status should be the following:
    conditions:
    - lastTransitionTime: "2024-10-02T13:34:59Z"
      message: Cert Manager is not installed, please restart pod once dependency has been installed
      reason: MissingDependency
      status: "False"
      type: Accepted
    observedGeneration: 1

Comment on lines 194 to 298
if errors.IsAlreadyExists(err) {
// This error can happen when the operator is starting, and the create event for the topology has not being processed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm... We may expect more of these, and not only with this particular resource.

This is because, as a pattern, we check the presence of the resource in the topology, and not with the API server. But more importantly, because, even though it's "state of the world" reconciliation and therefore the controller already knows about all resources that exist (because it lists them all), the library watching for changes to the cache writes to the subscription channel resource by resource.

I think we should consider fixing this in the Policy Machinery instead, because now it may be defeating the purpose of "state of the world". We may be able to make the call to cache.Replace truly atomic – assuming it won't break with the granularity of multiple event types. I have a couple of ideas.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe Kuadrant/policy-machinery#35 may have fixed this issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still happens with the same error:

{"level":"error","ts":"2024-10-02T08:56:41+01:00","logger":"kuadrant-operator","msg":"reconciliation error","error":"configmaps \"topology\" already exists","stacktrace":"github.com/kuadrant/policy-machinery/controller.(*Controller).propagate\n\t/Users/chfan/dev/kuadrant-operator/vendor/github.com/kuadrant/policy-machinery/controller/controller.go:262\ngithub.com/kuadrant/policy-machinery/controller.(*Controller).subscribe.func2\n\t/Users/chfan/dev/kuadrant-operator/vendor/github.com/kuadrant/policy-machinery/controller/controller.go:312"}
{"level":"info","ts":"2024-10-02T08:56:41+01:00","logger":"kuadrant-operator.controller-runtime.metrics","msg":"Starting metrics server"}
{"level":"info","ts":"2024-10-02T08:56:41+01:00","logger":"kuadrant-operator","msg":"starting server","name":"health probe","addr":"[::]:8081"}

What looks to happens from debugging is that this code is ran on boot even when there has been no events and when the topology has been loaded with the cluster data yet 🤔

My guess is that since this is a task with no event matchers, the initial creation of the empty cache store is an event that causes these task to run 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the initial creation of the empty cache store is an event that causes these task to run

Oh! This makes total sense and explains why it's triggering before the controller finishes starting. By simply creating the cache we may be generating events already. Nice catch, @KevFan!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing that somewhere here, when the snapshot contains zero updates, then we just continue?

Another option is, if len(events) == 0 { continue } here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KevFan KevFan force-pushed the issues/824 branch 4 times, most recently from 4c5cef5 to 84f8a86 Compare October 2, 2024 09:16
Comment on lines +231 to +257
controller.WithPredicates(ctrlruntimepredicate.TypedFuncs[*certmanagerv1.Certificate]{
CreateFunc: func(e event.TypedCreateEvent[*certmanagerv1.Certificate]) bool {
return isCertificateOwnedByTLSPolicy(e.Object)
},
UpdateFunc: func(e event.TypedUpdateEvent[*certmanagerv1.Certificate]) bool {
return isCertificateOwnedByTLSPolicy(e.ObjectNew)
},
DeleteFunc: func(e event.TypedDeleteEvent[*certmanagerv1.Certificate]) bool {
return isCertificateOwnedByTLSPolicy(e.Object)
},
GenericFunc: func(e event.TypedGenericEvent[*certmanagerv1.Certificate]) bool {
return isCertificateOwnedByTLSPolicy(e.Object)
},
})),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting approach. I wonder how it performs compared to labels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I went with ownerReference instead since we already set the ownerRef to point to TLSPolicy since we want the associated Certificates deleted when the TLSPolicy is deleted.

Also we don't currently set a common label across all created Certificate's

func commonTLSCertificateLabels(gwKey client.ObjectKey, p *v1alpha1.TLSPolicy) map[string]string {
common := map[string]string{}
for k, v := range policyTLSCertificateLabels(p) {
common[k] = v
}
for k, v := range gatewayTLSCertificateLabels(gwKey) {
common[k] = v
}
return common
}
func policyTLSCertificateLabels(p *v1alpha1.TLSPolicy) map[string]string {
return map[string]string{
p.DirectReferenceAnnotationName(): p.Name,
fmt.Sprintf("%s-namespace", p.DirectReferenceAnnotationName()): p.Namespace,
}
}
func gatewayTLSCertificateLabels(gwKey client.ObjectKey) map[string]string {
return map[string]string{
"gateway-namespace": gwKey.Namespace,
"gateway": gwKey.Name,
}
}

@KevFan KevFan self-assigned this Oct 2, 2024
@KevFan KevFan added kind/enhancement New feature or request size/medium labels Oct 2, 2024
@KevFan KevFan marked this pull request as ready for review October 2, 2024 13:47
@KevFan KevFan force-pushed the issues/824 branch 3 times, most recently from c83f916 to 5e2d529 Compare October 4, 2024 08:20
Comment on lines 72 to 69
// TODO: This should be only one target ref for now, but what should happen if multiple target refs is supported in the future?
targetRefs := policy.GetTargetRefs()
for _, targetRef := range targetRefs {
// Find gateway defined by target ref
_, ok := lo.Find(gws, func(item *machinery.Gateway) bool {
if item.GetName() == targetRef.GetName() && item.GetNamespace() == targetRef.GetNamespace() {
return true
}
return false
})

// Can't find gateway target ref
if !ok {
logger.V(1).Info("tls policy cannot find target ref", "name", policy.Name, "namespace", policy.Namespace)
s.Store(TLSPolicyAcceptedKey(policy.GetUID()), kuadrant.NewErrTargetNotFound(policy.Kind(), policy.GetTargetRef(), apierrors.NewNotFound(kuadrantv1alpha1.TLSPoliciesResource.GroupResource(), policy.GetName())))
continue
}

logger.V(1).Info("tls policy found target ref", "name", policy.Name, "namespace", policy.Namespace)
s.Store(TLSPolicyAcceptedKey(policy.GetUID()), nil)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, we may not need to find the object the policy points to.

Because policies are linked to their targets already, a policy p where len(p.GetTargetRefs()) > 0 && len(topology.Targetables().Children(p)) == 0 cannot be valid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously, what I'm suggesting won't give enough details to report in the message which target cannot be found.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh cool, yeah that's a nicer way of knowing whether a policy can't find their targetRef, I'll update 👍

Copy link
Contributor Author

@KevFan KevFan Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len(p.GetTargetRefs()) != len(topology.Targetables().Children(p)) is probably what we expect instead? 🤔
I think this is fine as we currently don't support multiple target refs but this may not work if we want to report which is not found as you've said, and also not sure what the policy status should be in the case where at least one target ref in the list is not found 🤔

if err := t.isCertificatesReady(ctx, tlsPolicy, topology); err != nil {
return kuadrant.EnforcedCondition(tlsPolicy, kuadrant.NewErrUnknown(tlsPolicy.Kind(), err), false)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLS is an interesting case. If the gateway is not Programmed, it doesn't mean we cannot provision the certificates for its listeners. Is that why you didn't include the status of the gateway here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, a gateway can transition to a Programmed state from the certs created by the current tls policy controller.


store, ok := s.Load(TLSPolicyAcceptedKey)
if !ok {
logger.Error(errors.New("TLSPolicyAcceptedKey not found, skipping update of tls policy statuses"), "sync map error")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok == false should never happen, given both reconcilers subscribe to the exact same event matchers, unless the validator panics half-way. Good call logging the error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this should never happen. This can also happen if there is a difference in the subscriptions between the validator and status updater in the future

Copy link
Contributor

@guicassolato guicassolato Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's how I'm solving for RateLimitPolicy the issue of updating the status after having previously validated the policy vs. when the policy validator is skipped but the status updater still has to run (e.g. due to different event matchers):

policyAcceptedFunc := rateLimitPolicyAcceptedStatusFunc(state)

}

// Nothing to do
equalStatus := equality.Semantic.DeepEqual(newStatus, policy.Status)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be careful to never allow unexported fields in the TLSPolicyStatus type, otherwise this line will start panicking.

Copy link
Contributor

@guicassolato guicassolato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested it, but the code LGTM.

I've left a comment about a pattern for picking up state from previous tasks vs falling back to the topology state, but I'd recommend considering another PR for it.

@KevFan KevFan merged commit adb1362 into Kuadrant:main Oct 9, 2024
24 checks passed
@mikenairn mikenairn mentioned this pull request Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement New feature or request size/medium
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[state-of-the-world reconciler] TLSPolicy status
3 participants