Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8SPXC-1152: restore stucks on operator restart #1610

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

pooknull
Copy link
Contributor

@pooknull pooknull commented Jan 30, 2024

K8SPXC-1152 Powered by Pull Request Badge

https://perconadev.atlassian.net/browse/K8SPXC-1152

CHANGE DESCRIPTION

Problem:
If operator pod is restarted during a restore, it can't continue to the restore process.

Cause:
The current design of the restore process is not designed to continue on operator restart.

Solution:
We should refactor the restore code so that the operator can catch up with the current state of the restore and continue.

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?
  • Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are the manifests (crd/bundle) regenerated if needed?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported PXC version?
  • Does the change support oldest and newest supported Kubernetes version?

@pull-request-size pull-request-size bot added the size/XXL 1000+ lines label Jan 30, 2024
@pooknull pooknull marked this pull request as ready for review February 5, 2024 13:56
inelpandzic
inelpandzic previously approved these changes Feb 6, 2024
if err != nil {
return rr, errors.Wrap(err, "run pitr")
switch err {
case errWaitingPods, errWaitingPVC:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we check these errors on the if line so we can avoid this inner switch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
return rr, nil
} else {
if cluster.Status.ObservedGeneration == cluster.Generation && cluster.Status.PXC.Status == api.AppStateReady {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about this condition. why do we say waiting for cluster to start only if cluster.Status.PXC.Status is ready?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 92 to 96
rr := reconcile.Result{
RequeueAfter: time.Second * 5,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

honestly I'm not happy to depend on RequeueAfter but I guess there's no other way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a way to not depend on RequeueAfter, but it will take more time to implement. I would like to do it in a separate PR.

Comment on lines 144 to 148
if err := s.k8sClient.Get(ctx, types.NamespacedName{Name: svc.Name, Namespace: svc.Namespace}, svc); err != nil {
if k8serrors.IsNotFound(err) {
initInProcess = false
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if err := s.k8sClient.Get(ctx, types.NamespacedName{Name: svc.Name, Namespace: svc.Namespace}, svc); err != nil {
if k8serrors.IsNotFound(err) {
initInProcess = false
}
}
if err := s.k8sClient.Get(ctx, types.NamespacedName{Name: svc.Name, Namespace: svc.Namespace}, svc); k8serrors.IsNotFound(err) {
initInProcess = false
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inelpandzic
inelpandzic previously approved these changes Feb 13, 2024
https://perconadev.atlassian.net/browse/K8SPXC-1152

fix unit test

fix for pvc restores

refactor

fix tests

fix `security-context` test

add unit-test

improvements

add TODO comment
@pooknull pooknull marked this pull request as draft January 9, 2025 13:02
@pooknull pooknull marked this pull request as ready for review January 20, 2025 09:21
@pooknull pooknull requested a review from gkech as a code owner January 20, 2025 09:21
@JNKPercona
Copy link
Collaborator

Test name Status
affinity-8-0 passed
auto-tuning-8-0 passed
cross-site-8-0 passed
custom-users-8-0 passed
demand-backup-cloud-8-0 passed
demand-backup-encrypted-with-tls-8-0 passed
demand-backup-8-0 passed
haproxy-5-7 passed
haproxy-8-0 passed
init-deploy-5-7 passed
init-deploy-8-0 passed
limits-8-0 passed
monitoring-2-0-8-0 passed
one-pod-5-7 passed
one-pod-8-0 passed
pitr-8-0 passed
pitr-gap-errors-8-0 passed
proxy-protocol-8-0 passed
proxysql-sidecar-res-limits-8-0 passed
pvc-resize-5-7 passed
pvc-resize-8-0 passed
recreate-8-0 passed
restore-to-encrypted-cluster-8-0 passed
scaling-proxysql-8-0 passed
scaling-8-0 passed
scheduled-backup-5-7 passed
scheduled-backup-8-0 passed
security-context-8-0 passed
smart-update1-8-0 passed
smart-update2-8-0 passed
storage-8-0 passed
tls-issue-cert-manager-ref-8-0 passed
tls-issue-cert-manager-8-0 passed
tls-issue-self-8-0 passed
upgrade-consistency-8-0 passed
upgrade-haproxy-5-7 passed
upgrade-haproxy-8-0 passed
upgrade-proxysql-5-7 passed
upgrade-proxysql-8-0 passed
users-5-7 passed
users-8-0 passed
validation-hook-8-0 passed
We run 42 out of 42

commit: 190e857
image: perconalab/percona-xtradb-cluster-operator:PR-1610-190e8579

@egegunes egegunes added this to the v1.17.0 milestone Jan 21, 2025
Comment on lines +213 to +221
paused, err := k8s.PauseCluster(ctx, r.client, cluster)
if err != nil {
return rr, errors.Wrapf(err, "stop cluster %s", cluster.Name)
}
if !paused {
log.Info("waiting for cluster to stop", "cluster", cr.Spec.PXCCluster, "msg", err.Error())
return rr, nil
}
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this commented code?

var oldHAProxySize int32
if cluster.Spec.HAProxy != nil {
oldHAProxySize = cluster.Spec.HAProxy.Size
switch statusState {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should move code for each case to a separate function? this would make code more readable

if err != nil {
err = errors.Wrapf(err, "get cluster %s", cr.Spec.PXCCluster)
if otherRestore != nil {
err = errors.Errorf("unable to continue, concurent restore job %s running now", otherRestore.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: concurent -> concurrent

Comment on lines +69 to +79
rJobsList := &api.PerconaXtraDBClusterRestoreList{}
err := cl.List(
ctx,
rJobsList,
&client.ListOptions{
Namespace: cr.Namespace,
},
)
if err != nil {
return nil, errors.Wrap(err, "get restore jobs list")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can write this like that:

        rJobsList := &api.PerconaXtraDBClusterRestoreList{}
	if err := cl.List(
		ctx,
		rJobsList,
		&client.ListOptions{
			Namespace: cr.Namespace,
		},
	); err != nil {
		return nil, errors.Wrap(err, "get restore jobs list")
	}

Since the same error is not used after that context


func setStatus(ctx context.Context, cl client.Client, cr *api.PerconaXtraDBClusterRestore, state api.BcpRestoreStates, comments string) error {
cr.Status.State = state
switch state {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this swtich here really needed since are handling only one case?

Comment on lines +36 to +41
bcp := &api.PerconaXtraDBClusterBackup{}
err := cl.Get(ctx, types.NamespacedName{Name: cr.Spec.BackupName, Namespace: cr.Namespace}, bcp)
if err != nil {
err = errors.Wrapf(err, "get backup %s", cr.Spec.BackupName)
return bcp, err
}
Copy link
Contributor

@gkech gkech Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to other cases, we can use short statement if style we use throughout the codebase since this error here is not and should not be used in another part of the function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/XXL 1000+ lines
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants