K8SPXC-1152: restore stucks on operator restart #1610

pooknull · 2024-01-30T13:01:09Z

https://perconadev.atlassian.net/browse/K8SPXC-1152

CHANGE DESCRIPTION

Problem:
If operator pod is restarted during a restore, it can't continue to the restore process.

Cause:
The current design of the restore process is not designed to continue on operator restart.

Solution:
We should refactor the restore code so that the operator can catch up with the current state of the restore and continue.

CHECKLIST

Jira

Is the Jira ticket created and referenced properly?
Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

Is an E2E test/test case added for the new feature/change?
Are unit tests added where appropriate?
Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

Are all needed new/changed options added to default YAML files?
Are the manifests (crd/bundle) regenerated if needed?
Did we add proper logging messages for operator actions?
Did we ensure compatibility with the previous version or cluster upgrade process?
Does the change support oldest and newest supported PXC version?
Does the change support oldest and newest supported Kubernetes version?

inelpandzic · 2024-02-06T06:50:42Z

pkg/controller/pxcrestore/controller.go

 		if err != nil {
-			return rr, errors.Wrap(err, "run pitr")
+			switch err {
+			case errWaitingPods, errWaitingPVC:


Could we check these errors on the if line so we can avoid this inner switch?

egegunes · 2024-02-06T07:28:46Z

pkg/controller/pxcrestore/controller.go

+				}
+				return rr, nil
+			} else {
+				if cluster.Status.ObservedGeneration == cluster.Generation && cluster.Status.PXC.Status == api.AppStateReady {


not sure about this condition. why do we say waiting for cluster to start only if cluster.Status.PXC.Status is ready?

egegunes · 2024-02-06T07:29:50Z

pkg/controller/pxcrestore/controller.go

+	rr := reconcile.Result{
+		RequeueAfter: time.Second * 5,
+	}


honestly I'm not happy to depend on RequeueAfter but I guess there's no other way

There is a way to not depend on RequeueAfter, but it will take more time to implement. I would like to do it in a separate PR.

egegunes · 2024-02-06T07:32:43Z

pkg/controller/pxcrestore/restorer.go

+	if err := s.k8sClient.Get(ctx, types.NamespacedName{Name: svc.Name, Namespace: svc.Namespace}, svc); err != nil {
+		if k8serrors.IsNotFound(err) {
+			initInProcess = false
+		}
 	}


Suggested change

if err := s.k8sClient.Get(ctx, types.NamespacedName{Name: svc.Name, Namespace: svc.Namespace}, svc); err != nil {

if k8serrors.IsNotFound(err) {

initInProcess = false

}

}

if err := s.k8sClient.Get(ctx, types.NamespacedName{Name: svc.Name, Namespace: svc.Namespace}, svc); k8serrors.IsNotFound(err) {

initInProcess = false

}

https://perconadev.atlassian.net/browse/K8SPXC-1152 fix unit test fix for pvc restores refactor fix tests fix `security-context` test add unit-test improvements add TODO comment

JNKPercona · 2025-01-20T11:55:47Z

Test name	Status
affinity-8-0	passed
auto-tuning-8-0	passed
cross-site-8-0	passed
custom-users-8-0	passed
demand-backup-cloud-8-0	passed
demand-backup-encrypted-with-tls-8-0	passed
demand-backup-8-0	passed
haproxy-5-7	passed
haproxy-8-0	passed
init-deploy-5-7	passed
init-deploy-8-0	passed
limits-8-0	passed
monitoring-2-0-8-0	passed
one-pod-5-7	passed
one-pod-8-0	passed
pitr-8-0	passed
pitr-gap-errors-8-0	passed
proxy-protocol-8-0	passed
proxysql-sidecar-res-limits-8-0	passed
pvc-resize-5-7	passed
pvc-resize-8-0	passed
recreate-8-0	passed
restore-to-encrypted-cluster-8-0	passed
scaling-proxysql-8-0	passed
scaling-8-0	passed
scheduled-backup-5-7	passed
scheduled-backup-8-0	passed
security-context-8-0	passed
smart-update1-8-0	passed
smart-update2-8-0	passed
storage-8-0	passed
tls-issue-cert-manager-ref-8-0	passed
tls-issue-cert-manager-8-0	passed
tls-issue-self-8-0	passed
upgrade-consistency-8-0	passed
upgrade-haproxy-5-7	passed
upgrade-haproxy-8-0	passed
upgrade-proxysql-5-7	passed
upgrade-proxysql-8-0	passed
users-5-7	passed
users-8-0	passed
validation-hook-8-0	passed
We run 42 out of 42

commit: 190e857
image: perconalab/percona-xtradb-cluster-operator:PR-1610-190e8579

egegunes · 2025-01-21T10:30:09Z

pkg/controller/pxcrestore/controller.go

+				paused, err := k8s.PauseCluster(ctx, r.client, cluster)
+				if err != nil {
+					return rr, errors.Wrapf(err, "stop cluster %s", cluster.Name)
+				}
+				if !paused {
+					log.Info("waiting for cluster to stop", "cluster", cr.Spec.PXCCluster, "msg", err.Error())
+					return rr, nil
+				}
+			*/


do we need this commented code?

egegunes · 2025-01-21T10:30:49Z

pkg/controller/pxcrestore/controller.go

-		var oldHAProxySize int32
-		if cluster.Spec.HAProxy != nil {
-			oldHAProxySize = cluster.Spec.HAProxy.Size
+	switch statusState {


maybe we should move code for each case to a separate function? this would make code more readable

gkech · 2025-01-21T08:23:00Z

pkg/controller/pxcrestore/controller.go

-	if err != nil {
-		err = errors.Wrapf(err, "get cluster %s", cr.Spec.PXCCluster)
+	if otherRestore != nil {
+		err = errors.Errorf("unable to continue, concurent restore job %s running now", otherRestore.Name)


typo: concurent -> concurrent

gkech · 2025-01-21T11:29:44Z

pkg/controller/pxcrestore/util.go

+	rJobsList := &api.PerconaXtraDBClusterRestoreList{}
+	err := cl.List(
+		ctx,
+		rJobsList,
+		&client.ListOptions{
+			Namespace: cr.Namespace,
+		},
+	)
+	if err != nil {
+		return nil, errors.Wrap(err, "get restore jobs list")
+	}


We can write this like that:

rJobsList := &api.PerconaXtraDBClusterRestoreList{} if err := cl.List( ctx, rJobsList, &client.ListOptions{ Namespace: cr.Namespace, }, ); err != nil { return nil, errors.Wrap(err, "get restore jobs list") }

Since the same error is not used after that context

gkech · 2025-01-21T11:31:10Z

pkg/controller/pxcrestore/util.go

+
+func setStatus(ctx context.Context, cl client.Client, cr *api.PerconaXtraDBClusterRestore, state api.BcpRestoreStates, comments string) error {
+	cr.Status.State = state
+	switch state {


Is this swtich here really needed since are handling only one case?

gkech · 2025-01-21T11:34:00Z

pkg/controller/pxcrestore/util.go

+	bcp := &api.PerconaXtraDBClusterBackup{}
+	err := cl.Get(ctx, types.NamespacedName{Name: cr.Spec.BackupName, Namespace: cr.Namespace}, bcp)
+	if err != nil {
+		err = errors.Wrapf(err, "get backup %s", cr.Spec.BackupName)
+		return bcp, err
+	}


Similar to other cases, we can use short statement if style we use throughout the codebase since this error here is not and should not be used in another part of the function.

pull-request-size bot added the size/XXL 1000+ lines label Jan 30, 2024

pooknull marked this pull request as ready for review February 5, 2024 13:56

pooknull requested review from tplavcic, nmarukovich, ptankov, hors, egegunes and inelpandzic as code owners February 5, 2024 13:56

inelpandzic previously approved these changes Feb 6, 2024

View reviewed changes

egegunes reviewed Feb 6, 2024

View reviewed changes

pooknull dismissed inelpandzic’s stale review via 7b5a420 February 7, 2024 12:21

pooknull requested review from inelpandzic and egegunes February 7, 2024 12:36

inelpandzic previously approved these changes Feb 13, 2024

View reviewed changes

K8SPXC-1152: restore stucks on operator restart

df69c14

https://perconadev.atlassian.net/browse/K8SPXC-1152 fix unit test fix for pvc restores refactor fix tests fix `security-context` test add unit-test improvements add TODO comment

pooknull dismissed inelpandzic’s stale review via df69c14 January 9, 2025 10:58

pooknull force-pushed the dev/K8SPXC-1152 branch from f86ff81 to df69c14 Compare January 9, 2025 10:58

pooknull requested review from jvpasinatto and eleo007 as code owners January 9, 2025 10:58

small fixes

2b735d7

pooknull marked this pull request as draft January 9, 2025 13:02

pooknull added 4 commits January 13, 2025 16:07

fix

f1a7a1d

fix

673e00b

fix unit-test

44f96ce

Merge branch 'main' into dev/K8SPXC-1152

190e857

pooknull marked this pull request as ready for review January 20, 2025 09:21

pooknull requested a review from gkech as a code owner January 20, 2025 09:21

egegunes added this to the v1.17.0 milestone Jan 21, 2025

egegunes requested changes Jan 21, 2025

View reviewed changes

gkech reviewed Jan 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K8SPXC-1152: restore stucks on operator restart #1610

K8SPXC-1152: restore stucks on operator restart #1610

pooknull commented Jan 30, 2024 •

edited

Loading

inelpandzic Feb 6, 2024

pooknull Feb 7, 2024

egegunes Feb 6, 2024

pooknull Feb 7, 2024

egegunes Feb 6, 2024

pooknull Feb 7, 2024

egegunes Feb 6, 2024

pooknull Feb 7, 2024

JNKPercona commented Jan 20, 2025

egegunes Jan 21, 2025

egegunes Jan 21, 2025

gkech Jan 21, 2025

gkech Jan 21, 2025

gkech Jan 21, 2025

gkech Jan 21, 2025 •

edited

Loading

K8SPXC-1152: restore stucks on operator restart #1610

Are you sure you want to change the base?

K8SPXC-1152: restore stucks on operator restart #1610

Conversation

pooknull commented Jan 30, 2024 • edited Loading

CHANGE DESCRIPTION

CHECKLIST

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JNKPercona commented Jan 20, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gkech Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

pooknull commented Jan 30, 2024 •

edited

Loading

gkech Jan 21, 2025 •

edited

Loading