Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The operator will wait forever for kube-apiserver to update in case the domain or apicert in the CR is replaced #100

Open
eranco74 opened this issue Sep 12, 2023 · 2 comments

Comments

@eranco74
Copy link
Contributor

I updated the domain and apiCert of an existing clusterrelocation CR.

The operator keeps logging this:

2023-09-12T15:17:00Z	INFO	controllers/clusterrelocation_controller.go:398	Waiting for kube-apiserver to update	{"controller": "clusterrelocation", "controllerGroup": "rhsyseng.github.io", "controllerKind": "ClusterRelocation", "ClusterRelocation": {"name":"cluster"}, "namespace": "", "name": "cluster", "reconcileID": "daa97d73-2824-4080-8eea-c8431e4c665a"}
^[[B^[[D2023-09-12T15:17:10Z	INFO	controllers/clusterrelocation_controller.go:398	Waiting for kube-apiserver to update	{"controller": "clusterrelocation", "controllerGroup": "rhsyseng.github.io", "controllerKind": "ClusterRelocation", "ClusterRelocation": {"name":"cluster"}, "namespace": "", "name": "cluster", "reconcileID": "daa97d73-2824-4080-8eea-c8431e4c665a"}

Seems that the operator will not move to progressing because there's no need to update the kube-apiserver deployment (the secret name stay the same...)
so it just hangs here:

func WaitForCO(ctx context.Context, c client.Client, logger logr.Logger, operator string) error {

Expected Behavior

Expected the update to work, I geuss a better check we have here is good enough, after that we can just wait for the operator status to be available, why do we need to wait for progressing??

Current Behavior

the opeartor is stuck waiting for apiserver to update (move to progressing=true) although it's already updated...

Possible Solution

Steps to Reproduce (for bugs)

Context

Regression

Your Environment

  • Version used (cluster-relocation-operator):
  • Environment name and version (e.g. OCP v1.12.20):
  • Server type and version:
  • Operating System and version (uname -a):
  • Link to your deployment file:
@loganmc10
Copy link
Contributor

I'm not sure how it could get stuck, this is the code for that function:

func WaitForCO(ctx context.Context, c client.Client, logger logr.Logger, operator string) error {
	logger.Info(fmt.Sprintf("Waiting for %s Progressing to be %s", operator, configv1.ConditionFalse))
	if err := waitStatus(ctx, c, logger, operator, configv1.OperatorProgressing, configv1.ConditionFalse); err != nil {
		return err
	}

	logger.Info(fmt.Sprintf("Waiting for %s Available to be %s", operator, configv1.ConditionTrue))
	if err := waitStatus(ctx, c, logger, operator, configv1.OperatorAvailable, configv1.ConditionTrue); err != nil {
		return err
	}
	return nil
}

It waits for OperatorProgressing to be False and for OperatorAvailable to be True. It doesn't wait for Progressing to become True, so if the operator is good, it should return right away. Are you sure that the kube-apiserver operator was reporting the desired status?

@loganmc10
Copy link
Contributor

I think it is actually getting stuck here:

	for _, v := range urls {
		updated := false
		for {
			conn, err := tls.Dial("tcp", v["url"], &tls.Config{InsecureSkipVerify: true})
			if err != nil {
				return err
			}
			certs := conn.ConnectionState().PeerCertificates
			conn.Close()
			for _, cert := range certs {
				if cert.Subject.CommonName == v["commonName"] {
					updated = true
				}
			}
			if updated {
				// ensure that ClusterOperator has settled
				if err := util.WaitForCO(ctx, r.Client, logger, v["type"]); err != nil {
					return err
				}
				break
			} else {
				logger.Info(fmt.Sprintf("Waiting for %s to update", v["type"]))
				time.Sleep(time.Second * 10)
			}
		}
	}

It is waiting for a certificate with a commonName of api.newDomain, whatever certificate you're using doesn't have a commonName that matches that name. The default API cert that comes with the cluster has this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants