Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation for the restore queue #128

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

mpryc
Copy link
Collaborator

@mpryc mpryc commented Dec 6, 2024

Adds the estimated queue to the NAB restore object

Why the changes were made

To include estimated queue number for the restore object.

How to test the changes made

  1. Create multiple NAR objects and observe if the queue is proper.
  2. Run tests from the source make simulation-test
$ make simulation-test
[...]
Ran 19 of 19 Specs in 23.446 seconds
SUCCESS! -- 19 Passed | 0 Failed | 0 Pending | 0 Skipped
--- PASS: TestControllers (23.45s)
coverage: 64.6% of statements

Copy link

openshift-ci bot commented Dec 6, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link

openshift-ci bot commented Dec 6, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mpryc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mpryc
Copy link
Collaborator Author

mpryc commented Dec 6, 2024

@mateusoliveira43 note the file where I've moved the common types. Let me know if that is fine, currently we were mixing backup types inside restore object: https://github.com/migtools/oadp-non-admin/pull/128/files#diff-bce3cc24a9a3be49f46a28f3a11baff2a1f07f5e735a1dbc683e535e1aaf625d

@mateusoliveira43
Copy link
Contributor

@mpryc it is fine

I would change name to something like common.go to avoid any possible conflict with kubebuilder in the future

Adds the estimated queue to the NAB restore object

Signed-off-by: Michal Pryc <[email protected]>
Test coverage for the status queue within restore operation.

Signed-off-by: Michal Pryc <[email protected]>
mpryc added a commit to mpryc/oadp-operator that referenced this pull request Dec 10, 2024
CRD updates for the following non-admin PR:
  migtools/oadp-non-admin#128

Signed-off-by: Michal Pryc <[email protected]>
@mpryc mpryc marked this pull request as ready for review December 10, 2024 20:01
@openshift-ci openshift-ci bot requested a review from mrnold December 10, 2024 20:01
@mpryc
Copy link
Collaborator Author

mpryc commented Dec 10, 2024

Blocked by: openshift/oadp-operator#1607

@mpryc mpryc changed the title Initial implementation for the restore queue Implementation for the restore queue Dec 10, 2024
@mateusoliveira43
Copy link
Contributor

@mpryc avoid using make test and use make simulation-test, as described in our documentation https://github.com/migtools/oadp-non-admin/blob/master/docs/CONTRIBUTING.md#code-quality-and-standardization

var veleroRestoreList velerov1.RestoreList
labelSelector := client.MatchingLabels{labelKey: labelValue}

if err := clientInstance.List(ctx, &veleroRestoreList, client.InNamespace(namespace), labelSelector); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what you want here is to get all NAC "owned" velero restores?

I suggest doing like this

func GetActiveOwnedVeleroRestores(ctx context.Context, clientInstance client.Client, namespace string) ([]velerov1.Restore, error) {
	var veleroRestoreList velerov1.RestoreList
	if err := ListObjectsByLabel(ctx, clientInstance, namespace, constant.ManagedByLabel, constant.ManagedByLabelValue, veleroRestoreList); err != nil {
		return nil, err
	}

	var activeRestores []velerov1.Restore
	for _, restore := range veleroRestoreList.Items {
		if restore.Status.CompletionTimestamp == nil && CheckVeleroRestoreMetadata(restore) {
			activeRestores = append(activeRestores, restore)
		}
	}

	if len(activeRestores) == 0 {
		return nil, nil
	}

	return activeRestores, nil
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want all the NAC owned restores that does not have restore.Status.CompletionTimestamp, which means they are still subject to be handled. We are making queue number based on the list of the objects that are still subject to some work by Velero, the ones that are completed (has the CompletionTimestamp) are not valid anymore.

Will change the implementation to use the ListObjectsByLabel.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also be changed? I don't want this to be any different? This PR/separate ?

func GetActiveVeleroBackupsByLabel(ctx context.Context, clientInstance client.Client, namespace, labelKey, labelValue string) ([]velerov1.Backup, error) {
var veleroBackupList velerov1.BackupList
labelSelector := client.MatchingLabels{labelKey: labelValue}
if err := clientInstance.List(ctx, &veleroBackupList, client.InNamespace(namespace), labelSelector); err != nil {
return nil, err
}
// Filter out backups with a CompletionTimestamp
var activeBackups []velerov1.Backup
for _, backup := range veleroBackupList.Items {
if backup.Status.CompletionTimestamp == nil {
activeBackups = append(activeBackups, backup)
}
}
if len(activeBackups) == 0 {
return nil, nil
}
return activeBackups, nil
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok doing change in this PR as well

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will leave as it is in this PR if you agree - changed the backup function and now the tests are failing showing always queue 0. I don't want to spent time in this PR to debug why. Current implementation works as expected.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is what I checked (even without checking metadata):

func GetActiveOwnedVeleroBackups(ctx context.Context, clientInstance client.Client, namespace string) ([]velerov1.Backup, error) {
	var veleroBackupList := &velerov1.BackupList{}

	if err := ListObjectsByLabel(ctx, clientInstance, namespace, constant.ManagedByLabel, constant.ManagedByLabelValue, veleroBackupList); err != nil {
		return nil, err
	}

	// Filter out backups with a CompletionTimestamp
	var activeBackups []velerov1.Backup
	for _, backup := range veleroBackupList.Items {
		if backup.Status.CompletionTimestamp == nil {
			activeBackups = append(activeBackups, backup)
		}
	}

	if len(activeBackups) == 0 {
		return nil, nil
	}

	return activeBackups, nil
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we need new predicate and handler for queue

doing these changes on the current ones, does not give the same result?

diff --git a/internal/handler/velerorestore_handler.go b/internal/handler/velerorestore_handler.go
index 515de66..ff2b23c 100644
--- a/internal/handler/velerorestore_handler.go
+++ b/internal/handler/velerorestore_handler.go
@@ -21,6 +21,7 @@ import (
 
 	"k8s.io/apimachinery/pkg/types"
 	"k8s.io/client-go/util/workqueue"
+	"sigs.k8s.io/controller-runtime/pkg/client"
 	"sigs.k8s.io/controller-runtime/pkg/event"
 	"sigs.k8s.io/controller-runtime/pkg/reconcile"
 
@@ -29,7 +30,10 @@ import (
 )
 
 // VeleroRestoreHandler contains event handlers for Velero Restore objects
-type VeleroRestoreHandler struct{}
+type VeleroRestoreHandler struct {
+	Client        client.Client
+	OADPNamespace string
+}
 
 // Create event handler
 func (VeleroRestoreHandler) Create(_ context.Context, _ event.CreateEvent, _ workqueue.RateLimitingInterface) {
@@ -37,18 +41,42 @@ func (VeleroRestoreHandler) Create(_ context.Context, _ event.CreateEvent, _ wor
 }
 
 // Update event handler adds Velero Restore's NonAdminRestore to controller queue
-func (VeleroRestoreHandler) Update(ctx context.Context, evt event.UpdateEvent, q workqueue.RateLimitingInterface) {
+func (h VeleroRestoreHandler) Update(ctx context.Context, evt event.UpdateEvent, q workqueue.RateLimitingInterface) {
 	logger := function.GetLogger(ctx, evt.ObjectNew, "VeleroRestoreHandler")
 
 	annotations := evt.ObjectNew.GetAnnotations()
 	nonAdminRestoreName := annotations[constant.NarOriginNameAnnotation]
 	nonAdminRestoreNamespace := annotations[constant.NarOriginNamespaceAnnotation]
 
-	q.Add(reconcile.Request{NamespacedName: types.NamespacedName{
-		Name:      nonAdminRestoreName,
-		Namespace: nonAdminRestoreNamespace,
-	}})
-	logger.V(1).Info("Handled Update event")
+	if function.CheckVeleroRestoreMetadata(evt.ObjectNew) {
+		q.Add(reconcile.Request{NamespacedName: types.NamespacedName{
+			Name:      nonAdminRestoreName,
+			Namespace: nonAdminRestoreNamespace,
+		}})
+		logger.V(1).Info("Handled Update event")
+	}
+
+	restores, err := function.GetActiveVeleroRestoresByLabel(ctx, h.Client, h.OADPNamespace)
+	if err != nil {
+		logger.Error(err, "Failed to get Velero Restores by label")
+		return
+	}
+
+	if restores != nil {
+		for _, restore := range restores {
+			annotations := restore.GetAnnotations()
+			originName := annotations[constant.NarOriginNameAnnotation]
+			originNamespace := annotations[constant.NarOriginNamespaceAnnotation]
+
+			if originName != nonAdminRestoreName || originNamespace != nonAdminRestoreNamespace {
+				logger.V(1).Info("Processing Queue update for the NonAdmin Restore referenced by Velero Restore", "Name", restore.Name, constant.NamespaceString, restore.Namespace, "CreatedAt", restore.CreationTimestamp)
+				q.Add(reconcile.Request{NamespacedName: types.NamespacedName{
+					Name:      originName,
+					Namespace: originNamespace,
+				}})
+			}
+		}
+	}
 }
 
 // Delete event handler
diff --git a/internal/predicate/velerorestore_predicate.go b/internal/predicate/velerorestore_predicate.go
index 8dfdae8..e63e75b 100644
--- a/internal/predicate/velerorestore_predicate.go
+++ b/internal/predicate/velerorestore_predicate.go
@@ -19,6 +19,7 @@ package predicate
 import (
 	"context"
 
+	velerov1 "github.com/vmware-tanzu/velero/pkg/apis/velero/v1"
 	"sigs.k8s.io/controller-runtime/pkg/event"
 
 	"github.com/migtools/oadp-non-admin/internal/common/function"
@@ -40,6 +41,12 @@ func (p VeleroRestorePredicate) Update(ctx context.Context, evt event.UpdateEven
 			logger.V(1).Info("Accepted Update event")
 			return true
 		}
+		newRestore, _ := evt.ObjectNew.(*velerov1.Restore)
+		oldRestore, _ := evt.ObjectOld.(*velerov1.Restore)
+		if oldRestore.Status.CompletionTimestamp == nil && newRestore.Status.CompletionTimestamp != nil {
+			logger.V(1).Info("Accepted Update event: Restore completion timestamp")
+			return true
+		}
 	}
 
 	logger.V(1).Info("Rejected Update event")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I think we should stick with handler/predicate approach for two reasons:

  1. Be similar with how backup works:
  • func (p VeleroBackupQueuePredicate) Update(ctx context.Context, evt event.UpdateEvent) bool {
    logger := function.GetLogger(ctx, evt.ObjectNew, "VeleroBackupQueuePredicate")
    // Ensure the new and old objects are of the expected type
    newBackup, okNew := evt.ObjectNew.(*velerov1.Backup)
    oldBackup, okOld := evt.ObjectOld.(*velerov1.Backup)
    if !okNew || !okOld {
    logger.V(1).Info("Rejected Backup Update event: invalid object type")
    return false
    }
    namespace := newBackup.GetNamespace()
    if namespace == p.OADPNamespace {
    if oldBackup.Status.CompletionTimestamp == nil && newBackup.Status.CompletionTimestamp != nil {
    logger.V(1).Info("Accepted Backup Update event: new completion timestamp")
    return true
    }
    }
    logger.V(1).Info("Rejected Backup Update event: no changes to the CompletionTimestamp in the VeleroBackup object")
    return false
    }
  • func (h VeleroBackupQueueHandler) Update(ctx context.Context, evt event.UpdateEvent, q workqueue.RateLimitingInterface) {
  1. In k8s the predicate comes before handler, so we won't be invoking reconcile for cases where we can save some calls, in the above we are saving when the CompletionTimestamp is set from nil to something, meaning the object is "done". Only those are interested for us in regards how queue works. Those objects may have came to queue before ones that needs updating queue info. Anything that happened later is not interested as those are put later in the queue, meaning those will not affect any other objects that are in front. This allows to save quite a bit of calls on the predicate level. Secondly you are removing annotations check from the handler, which makes reconcile responsible for it. Reconcile comes after handler, which is also making more calls instead of reducing them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some tests out of a cluster, and I am afraid we need to merge them together

If queue predicate is triggered, both handlers are triggered, right?

If a admin backup triggers this, velero backup predicate will try to add NonAdminBackup with empty name and namespace to the queue, right? and this creates and endless error

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow this use case?

The following handlers are triggered after predicates:

Watches(&velerov1.Backup{}, &handler.VeleroBackupHandler{}).
Watches(&velerov1.Backup{}, &handler.VeleroBackupQueueHandler{

So both handlers are triggered based on the composite predicate, each handler does different job.

This one only for the object which actually triggered the event:
https://github.com/migtools/oadp-non-admin/blob/master/internal/handler/velerobackup_handler.go#L41-L53

And this one for all other objects to update statuses:
https://github.com/migtools/oadp-non-admin/blob/master/internal/handler/velerobackup_handler.go#L41-L53

See the namespace thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a backup not owned by NAC updates to completed, does that trigger the queue predicate? yes, right?

then, both handlers are called, and velerobackup_handler.go will try to add NonAdminBackup with empty name and namespace to the queue, right? and this creates and endless error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will open PR for that then

Can you wait that one to continue on this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really do prefer to have working and similar implementation on both parts. With test day proven this works I don't want to modify too much in this area at the moment as I want to focus on #36 and then improve if we find issues. This for me is not an issue, just small improvement in the implementation. Going off the reconcile is fine imo at this moment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit more why I think it's really not that important at the moment to make this effort:

  1. We already have tested and working implementation, possibly there are areas to improve, but we need to focus on other parts (sync controller, nabsl)
  2. The current flow is pretty clean we have
    a) predicates for each type of interesting event (first defence/filter against unnecessary reconciles)
    b) separate handlers for velero events that works with:
    Current NAB object:
    q.Add(reconcile.Request{NamespacedName: types.NamespacedName{

    Other interested NAB objects: https://github.com/migtools/oadp-non-admin/blob/master/internal/handler/velerobackup_queue_handler.go#L74-L81

The two above are not causing the same NAB to be reconciled twice.

If we want to modify and check if velero object event is not triggering the reconcile on non-existing NAB we would need to move the check if NAB really exists to the handler, which is not the way it should work. Reconcile function is the proper place for this check, as it centralizes the logic, ensuring there is only one point where the NAB's existence is verified, rather than duplicating this check in the handler.

Aren't we rewriting yet one more time reconcile?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mateusoliveira43 @mpryc Lets keep the restore queue work similar to what we had in backup queue. If there are cases that are not being covered with the current approach, lets create an issue for those and then the issues can be fixed in follow on PRs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mateusoliveira43 based on what @shubham-pampattiwar wrote, is there anything for me to get this merged? Please let me know so I can work on it if needed.

// TODO(migi): do we need estimatedQueuePosition in VeleroRestoreStatus?
updatedQueueInfo := false

// Determine how many Backups are scheduled before the given VeleroRestore in the OADP namespace.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo Backups

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -100,6 +100,13 @@ func checkTestNonAdminRestoreStatus(nonAdminRestore *nacv1alpha1.NonAdminRestore
return fmt.Errorf("NonAdminRestore Status Conditions [%v] Message %v does not contain expected message %v", index, nonAdminRestore.Status.Conditions[index].Message, expectedStatus.Conditions[index].Message)
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong

If nonAdminRestore.Status.QueueInfo is nil and expectedStatus.QueueInfo is not nil, no error is raised.

This happens in the test case Should update NonAdminRestore until it invalidates and then delete it. The QueueInfo of that NAR will be nil after reconciliation, but expected is QueueInfo: &nacv1alpha1.QueueInfo{EstimatedQueuePosition: 0}

To fully compare nonAdminRestore.Status.QueueInfo and expectedStatus.QueueInfo code needs to be updated to something like this

	if nonAdminRestore.Status.QueueInfo != nil {
		if expectedStatus.QueueInfo == nil {
			return fmt.Errorf("message")
		}
		if nonAdminRestore.Status.QueueInfo.EstimatedQueuePosition != expectedStatus.QueueInfo.EstimatedQueuePosition {
			return fmt.Errorf("NonAdminRestore Status QueueInfo EstimatedQueuePosition %v is not equal to expected %v", nonAdminRestore.Status.QueueInfo.EstimatedQueuePosition, expectedStatus.QueueInfo.EstimatedQueuePosition)
		}
	} else {
		if expectedStatus.QueueInfo != nil {
			return fmt.Errorf("message")
		}
	}

@@ -357,6 +364,9 @@ var _ = ginkgo.Describe("Test full reconcile loop of NonAdminRestore Controller"
LastTransitionTime: metav1.NewTime(time.Now()),
},
},
QueueInfo: &nacv1alpha1.QueueInfo{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know what happens if we tell Velero to restore a backup in progress. I suspect it fails

To make this test scenario more real, would change VeleroBackup.Status to phase completed, have CompletionTimestamp and QueueInfo.EstimatedQueuePosition to be zero

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no test for GetRestoreQueueInfo function, right? would be interesting to add it?

openshift-merge-bot bot pushed a commit to openshift/oadp-operator that referenced this pull request Dec 20, 2024
CRD updates for the following non-admin PR:
  migtools/oadp-non-admin#128

Signed-off-by: Michal Pryc <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants