Implementation for the restore queue #128

mpryc · 2024-12-06T18:10:49Z

Adds the estimated queue to the NAB restore object

Why the changes were made

To include estimated queue number for the restore object.

How to test the changes made

Create multiple NAR objects and observe if the queue is proper.
Run tests from the source make simulation-test

$ make simulation-test
[...]
Ran 19 of 19 Specs in 23.446 seconds
SUCCESS! -- 19 Passed | 0 Failed | 0 Pending | 0 Skipped
--- PASS: TestControllers (23.45s)
coverage: 64.6% of statements

openshift-ci · 2024-12-06T18:10:54Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2024-12-06T18:10:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mpryc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mpryc]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mpryc · 2024-12-06T18:12:01Z

@mateusoliveira43 note the file where I've moved the common types. Let me know if that is fine, currently we were mixing backup types inside restore object: https://github.com/migtools/oadp-non-admin/pull/128/files#diff-bce3cc24a9a3be49f46a28f3a11baff2a1f07f5e735a1dbc683e535e1aaf625d

mateusoliveira43 · 2024-12-09T11:17:24Z

@mpryc it is fine

I would change name to something like common.go to avoid any possible conflict with kubebuilder in the future

api/v1alpha1/nonadmin_types.go

Adds the estimated queue to the NAB restore object Signed-off-by: Michal Pryc <[email protected]>

Test coverage for the status queue within restore operation. Signed-off-by: Michal Pryc <[email protected]>

CRD updates for the following non-admin PR: migtools/oadp-non-admin#128 Signed-off-by: Michal Pryc <[email protected]>

mpryc · 2024-12-10T20:01:31Z

Blocked by: openshift/oadp-operator#1607

mateusoliveira43 · 2024-12-11T11:23:41Z

@mpryc avoid using make test and use make simulation-test, as described in our documentation https://github.com/migtools/oadp-non-admin/blob/master/docs/CONTRIBUTING.md#code-quality-and-standardization

mateusoliveira43 · 2024-12-11T12:04:20Z

internal/common/function/function.go

+	var veleroRestoreList velerov1.RestoreList
+	labelSelector := client.MatchingLabels{labelKey: labelValue}
+
+	if err := clientInstance.List(ctx, &veleroRestoreList, client.InNamespace(namespace), labelSelector); err != nil {


what you want here is to get all NAC "owned" velero restores?

I suggest doing like this

func GetActiveOwnedVeleroRestores(ctx context.Context, clientInstance client.Client, namespace string) ([]velerov1.Restore, error) { var veleroRestoreList velerov1.RestoreList if err := ListObjectsByLabel(ctx, clientInstance, namespace, constant.ManagedByLabel, constant.ManagedByLabelValue, veleroRestoreList); err != nil { return nil, err } var activeRestores []velerov1.Restore for _, restore := range veleroRestoreList.Items { if restore.Status.CompletionTimestamp == nil && CheckVeleroRestoreMetadata(restore) { activeRestores = append(activeRestores, restore) } } if len(activeRestores) == 0 { return nil, nil } return activeRestores, nil }

I want all the NAC owned restores that does not have restore.Status.CompletionTimestamp, which means they are still subject to be handled. We are making queue number based on the list of the objects that are still subject to some work by Velero, the ones that are completed (has the CompletionTimestamp) are not valid anymore.

Will change the implementation to use the ListObjectsByLabel.

Should this also be changed? I don't want this to be any different? This PR/separate ?

oadp-non-admin/internal/common/function/function.go

Lines 246 to 267 in 56afada

func GetActiveVeleroBackupsByLabel(ctx context.Context, clientInstance client.Client, namespace, labelKey, labelValue string) ([]velerov1.Backup, error) {

var veleroBackupList velerov1.BackupList

labelSelector := client.MatchingLabels{labelKey: labelValue}

if err := clientInstance.List(ctx, &veleroBackupList, client.InNamespace(namespace), labelSelector); err != nil {

return nil, err

}

// Filter out backups with a CompletionTimestamp

var activeBackups []velerov1.Backup

for _, backup := range veleroBackupList.Items {

if backup.Status.CompletionTimestamp == nil {

activeBackups = append(activeBackups, backup)

}

}

if len(activeBackups) == 0 {

return nil, nil

}

return activeBackups, nil

}

I am ok doing change in this PR as well

I will leave as it is in this PR if you agree - changed the backup function and now the tests are failing showing always queue 0. I don't want to spent time in this PR to debug why. Current implementation works as expected.

Here is what I checked (even without checking metadata):

func GetActiveOwnedVeleroBackups(ctx context.Context, clientInstance client.Client, namespace string) ([]velerov1.Backup, error) { var veleroBackupList := &velerov1.BackupList{} if err := ListObjectsByLabel(ctx, clientInstance, namespace, constant.ManagedByLabel, constant.ManagedByLabelValue, veleroBackupList); err != nil { return nil, err } // Filter out backups with a CompletionTimestamp var activeBackups []velerov1.Backup for _, backup := range veleroBackupList.Items { if backup.Status.CompletionTimestamp == nil { activeBackups = append(activeBackups, backup) } } if len(activeBackups) == 0 { return nil, nil } return activeBackups, nil }

mateusoliveira43 · 2024-12-11T12:06:43Z

internal/handler/velerorestore_queue_handler.go

I do not think we need new predicate and handler for queue

doing these changes on the current ones, does not give the same result?

diff --git a/internal/handler/velerorestore_handler.go b/internal/handler/velerorestore_handler.go index 515de66..ff2b23c 100644 --- a/internal/handler/velerorestore_handler.go +++ b/internal/handler/velerorestore_handler.go @@ -21,6 +21,7 @@ import ( "k8s.io/apimachinery/pkg/types" "k8s.io/client-go/util/workqueue" + "sigs.k8s.io/controller-runtime/pkg/client" "sigs.k8s.io/controller-runtime/pkg/event" "sigs.k8s.io/controller-runtime/pkg/reconcile" @@ -29,7 +30,10 @@ import ( ) // VeleroRestoreHandler contains event handlers for Velero Restore objects -type VeleroRestoreHandler struct{} +type VeleroRestoreHandler struct { + Client client.Client + OADPNamespace string +} // Create event handler func (VeleroRestoreHandler) Create(_ context.Context, _ event.CreateEvent, _ workqueue.RateLimitingInterface) { @@ -37,18 +41,42 @@ func (VeleroRestoreHandler) Create(_ context.Context, _ event.CreateEvent, _ wor } // Update event handler adds Velero Restore's NonAdminRestore to controller queue -func (VeleroRestoreHandler) Update(ctx context.Context, evt event.UpdateEvent, q workqueue.RateLimitingInterface) { +func (h VeleroRestoreHandler) Update(ctx context.Context, evt event.UpdateEvent, q workqueue.RateLimitingInterface) { logger := function.GetLogger(ctx, evt.ObjectNew, "VeleroRestoreHandler") annotations := evt.ObjectNew.GetAnnotations() nonAdminRestoreName := annotations[constant.NarOriginNameAnnotation] nonAdminRestoreNamespace := annotations[constant.NarOriginNamespaceAnnotation] - q.Add(reconcile.Request{NamespacedName: types.NamespacedName{ - Name: nonAdminRestoreName, - Namespace: nonAdminRestoreNamespace, - }}) - logger.V(1).Info("Handled Update event") + if function.CheckVeleroRestoreMetadata(evt.ObjectNew) { + q.Add(reconcile.Request{NamespacedName: types.NamespacedName{ + Name: nonAdminRestoreName, + Namespace: nonAdminRestoreNamespace, + }}) + logger.V(1).Info("Handled Update event") + } + + restores, err := function.GetActiveVeleroRestoresByLabel(ctx, h.Client, h.OADPNamespace) + if err != nil { + logger.Error(err, "Failed to get Velero Restores by label") + return + } + + if restores != nil { + for _, restore := range restores { + annotations := restore.GetAnnotations() + originName := annotations[constant.NarOriginNameAnnotation] + originNamespace := annotations[constant.NarOriginNamespaceAnnotation] + + if originName != nonAdminRestoreName || originNamespace != nonAdminRestoreNamespace { + logger.V(1).Info("Processing Queue update for the NonAdmin Restore referenced by Velero Restore", "Name", restore.Name, constant.NamespaceString, restore.Namespace, "CreatedAt", restore.CreationTimestamp) + q.Add(reconcile.Request{NamespacedName: types.NamespacedName{ + Name: originName, + Namespace: originNamespace, + }}) + } + } + } } // Delete event handler diff --git a/internal/predicate/velerorestore_predicate.go b/internal/predicate/velerorestore_predicate.go index 8dfdae8..e63e75b 100644 --- a/internal/predicate/velerorestore_predicate.go +++ b/internal/predicate/velerorestore_predicate.go @@ -19,6 +19,7 @@ package predicate import ( "context" + velerov1 "github.com/vmware-tanzu/velero/pkg/apis/velero/v1" "sigs.k8s.io/controller-runtime/pkg/event" "github.com/migtools/oadp-non-admin/internal/common/function" @@ -40,6 +41,12 @@ func (p VeleroRestorePredicate) Update(ctx context.Context, evt event.UpdateEven logger.V(1).Info("Accepted Update event") return true } + newRestore, _ := evt.ObjectNew.(*velerov1.Restore) + oldRestore, _ := evt.ObjectOld.(*velerov1.Restore) + if oldRestore.Status.CompletionTimestamp == nil && newRestore.Status.CompletionTimestamp != nil { + logger.V(1).Info("Accepted Update event: Restore completion timestamp") + return true + } } logger.V(1).Info("Rejected Update event")

Here I think we should stick with handler/predicate approach for two reasons:

Be similar with how backup works:

oadp-non-admin/internal/predicate/velerobackup_queue_predicate.go

Lines 37 to 60 in 56afada

func (p VeleroBackupQueuePredicate) Update(ctx context.Context, evt event.UpdateEvent) bool {

logger := function.GetLogger(ctx, evt.ObjectNew, "VeleroBackupQueuePredicate")

// Ensure the new and old objects are of the expected type

newBackup, okNew := evt.ObjectNew.(*velerov1.Backup)

oldBackup, okOld := evt.ObjectOld.(*velerov1.Backup)

if !okNew || !okOld {

logger.V(1).Info("Rejected Backup Update event: invalid object type")

return false

}

namespace := newBackup.GetNamespace()

if namespace == p.OADPNamespace {

if oldBackup.Status.CompletionTimestamp == nil && newBackup.Status.CompletionTimestamp != nil {

logger.V(1).Info("Accepted Backup Update event: new completion timestamp")

return true

}

}

logger.V(1).Info("Rejected Backup Update event: no changes to the CompletionTimestamp in the VeleroBackup object")

return false

}

oadp-non-admin/internal/handler/velerobackup_queue_handler.go

Line 45 in 56afada

func (h VeleroBackupQueueHandler) Update(ctx context.Context, evt event.UpdateEvent, q workqueue.RateLimitingInterface) {

In k8s the predicate comes before handler, so we won't be invoking reconcile for cases where we can save some calls, in the above we are saving when the CompletionTimestamp is set from nil to something, meaning the object is "done". Only those are interested for us in regards how queue works. Those objects may have came to queue before ones that needs updating queue info. Anything that happened later is not interested as those are put later in the queue, meaning those will not affect any other objects that are in front. This allows to save quite a bit of calls on the predicate level. Secondly you are removing annotations check from the handler, which makes reconcile responsible for it. Reconcile comes after handler, which is also making more calls instead of reducing them.

I made some tests out of a cluster, and I am afraid we need to merge them together

If queue predicate is triggered, both handlers are triggered, right?

If a admin backup triggers this, velero backup predicate will try to add NonAdminBackup with empty name and namespace to the queue, right? and this creates and endless error

I don't follow this use case?

The following handlers are triggered after predicates:

oadp-non-admin/internal/controller/nonadminbackup_controller.go

Lines 699 to 700 in 56afada

Watches(&velerov1.Backup{}, &handler.VeleroBackupHandler{}).

Watches(&velerov1.Backup{}, &handler.VeleroBackupQueueHandler{

So both handlers are triggered based on the composite predicate, each handler does different job.

This one only for the object which actually triggered the event:
https://github.com/migtools/oadp-non-admin/blob/master/internal/handler/velerobackup_handler.go#L41-L53

And this one for all other objects to update statuses:
https://github.com/migtools/oadp-non-admin/blob/master/internal/handler/velerobackup_handler.go#L41-L53

See the namespace thing.

if a backup not owned by NAC updates to completed, does that trigger the queue predicate? yes, right?

then, both handlers are called, and velerobackup_handler.go will try to add NonAdminBackup with empty name and namespace to the queue, right? and this creates and endless error

will open PR for that then

Can you wait that one to continue on this?

I really do prefer to have working and similar implementation on both parts. With test day proven this works I don't want to modify too much in this area at the moment as I want to focus on #36 and then improve if we find issues. This for me is not an issue, just small improvement in the implementation. Going off the reconcile is fine imo at this moment.

A bit more why I think it's really not that important at the moment to make this effort:

We already have tested and working implementation, possibly there are areas to improve, but we need to focus on other parts (sync controller, nabsl)

The current flow is pretty clean we have
a) predicates for each type of interesting event (first defence/filter against unnecessary reconciles)
b) separate handlers for velero events that works with:
Current NAB object:

oadp-non-admin/internal/handler/velerobackup_handler.go

Line 48 in 56afada

q.Add(reconcile.Request{NamespacedName: types.NamespacedName{

Other interested NAB objects: https://github.com/migtools/oadp-non-admin/blob/master/internal/handler/velerobackup_queue_handler.go#L74-L81

The two above are not causing the same NAB to be reconciled twice.

If we want to modify and check if velero object event is not triggering the reconcile on non-existing NAB we would need to move the check if NAB really exists to the handler, which is not the way it should work. Reconcile function is the proper place for this check, as it centralizes the logic, ensuring there is only one point where the NAB's existence is verified, rather than duplicating this check in the handler.

Aren't we rewriting yet one more time reconcile?

@mateusoliveira43 @mpryc Lets keep the restore queue work similar to what we had in backup queue. If there are cases that are not being covered with the current approach, lets create an issue for those and then the issues can be fixed in follow on PRs.

@mateusoliveira43 based on what @shubham-pampattiwar wrote, is there anything for me to get this merged? Please let me know so I can work on it if needed.

mateusoliveira43 · 2024-12-18T19:45:26Z

internal/controller/nonadminrestore_controller.go

-	// TODO(migi): do we need estimatedQueuePosition in VeleroRestoreStatus?
+	updatedQueueInfo := false
+
+	// Determine how many Backups are scheduled before the given VeleroRestore in the OADP namespace.


typo Backups

Signed-off-by: Michal Pryc <[email protected]>

mateusoliveira43 · 2024-12-19T19:28:12Z

internal/controller/nonadminrestore_controller_test.go

@@ -100,6 +100,13 @@ func checkTestNonAdminRestoreStatus(nonAdminRestore *nacv1alpha1.NonAdminRestore
 			return fmt.Errorf("NonAdminRestore Status Conditions [%v] Message %v does not contain expected message %v", index, nonAdminRestore.Status.Conditions[index].Message, expectedStatus.Conditions[index].Message)
 		}
 	}
+


This is wrong

If nonAdminRestore.Status.QueueInfo is nil and expectedStatus.QueueInfo is not nil, no error is raised.

This happens in the test case Should update NonAdminRestore until it invalidates and then delete it. The QueueInfo of that NAR will be nil after reconciliation, but expected is QueueInfo: &nacv1alpha1.QueueInfo{EstimatedQueuePosition: 0}

To fully compare nonAdminRestore.Status.QueueInfo and expectedStatus.QueueInfo code needs to be updated to something like this

if nonAdminRestore.Status.QueueInfo != nil { if expectedStatus.QueueInfo == nil { return fmt.Errorf("message") } if nonAdminRestore.Status.QueueInfo.EstimatedQueuePosition != expectedStatus.QueueInfo.EstimatedQueuePosition { return fmt.Errorf("NonAdminRestore Status QueueInfo EstimatedQueuePosition %v is not equal to expected %v", nonAdminRestore.Status.QueueInfo.EstimatedQueuePosition, expectedStatus.QueueInfo.EstimatedQueuePosition) } } else { if expectedStatus.QueueInfo != nil { return fmt.Errorf("message") } }

mateusoliveira43 · 2024-12-19T19:35:41Z

internal/controller/nonadminrestore_controller_test.go

@@ -357,6 +364,9 @@ var _ = ginkgo.Describe("Test full reconcile loop of NonAdminRestore Controller"
 						LastTransitionTime: metav1.NewTime(time.Now()),
 					},
 				},
+				QueueInfo: &nacv1alpha1.QueueInfo{


I do not know what happens if we tell Velero to restore a backup in progress. I suspect it fails

To make this test scenario more real, would change VeleroBackup.Status to phase completed, have CompletionTimestamp and QueueInfo.EstimatedQueuePosition to be zero

mateusoliveira43 · 2024-12-19T19:56:07Z

internal/common/function/function_test.go

there is no test for GetRestoreQueueInfo function, right? would be interesting to add it?

CRD updates for the following non-admin PR: migtools/oadp-non-admin#128 Signed-off-by: Michal Pryc <[email protected]>

openshift-ci bot added the do-not-merge/work-in-progress label Dec 6, 2024

mpryc requested a review from shubham-pampattiwar December 6, 2024 18:10

openshift-ci bot added the approved label Dec 6, 2024

mpryc requested a review from mateusoliveira43 December 6, 2024 18:11

mpryc force-pushed the queue-restore branch from e61b222 to 411d055 Compare December 6, 2024 18:13

mateusoliveira43 reviewed Dec 9, 2024

View reviewed changes

api/v1alpha1/nonadmin_types.go Outdated Show resolved Hide resolved

mpryc mentioned this pull request Dec 9, 2024

Move API conditions and phases into common api file #130

Merged

Initial implementation for the restore queue

80b1e2d

Adds the estimated queue to the NAB restore object Signed-off-by: Michal Pryc <[email protected]>

mpryc force-pushed the queue-restore branch from 411d055 to 80b1e2d Compare December 10, 2024 19:27

Add tests for the status queue and it's functions

f3bd9e7

Test coverage for the status queue within restore operation. Signed-off-by: Michal Pryc <[email protected]>

mpryc added a commit to mpryc/oadp-operator that referenced this pull request Dec 10, 2024

Non-admin CRD compatibility update for 128

54331c0

CRD updates for the following non-admin PR: migtools/oadp-non-admin#128 Signed-off-by: Michal Pryc <[email protected]>

mpryc mentioned this pull request Dec 10, 2024

Non-admin CRD compatibility update for 128 openshift/oadp-operator#1607

Merged

mpryc marked this pull request as ready for review December 10, 2024 20:01

openshift-ci bot removed the do-not-merge/work-in-progress label Dec 10, 2024

openshift-ci bot requested a review from mrnold December 10, 2024 20:01

mpryc changed the title ~~Initial implementation for the restore queue~~ Implementation for the restore queue Dec 10, 2024

mpryc requested a review from mateusoliveira43 December 10, 2024 20:02

mateusoliveira43 reviewed Dec 11, 2024

View reviewed changes

mateusoliveira43 reviewed Dec 18, 2024

View reviewed changes

Update nonadminrestore_controller.go

a2aef9f

Signed-off-by: Michal Pryc <[email protected]>

mateusoliveira43 reviewed Dec 19, 2024

View reviewed changes

openshift-merge-bot bot pushed a commit to openshift/oadp-operator that referenced this pull request Dec 20, 2024

Non-admin CRD compatibility update for 128 (#1607)

bddf645

CRD updates for the following non-admin PR: migtools/oadp-non-admin#128 Signed-off-by: Michal Pryc <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation for the restore queue #128

Implementation for the restore queue #128

mpryc commented Dec 6, 2024 •

edited

Loading

openshift-ci bot commented Dec 6, 2024

openshift-ci bot commented Dec 6, 2024

mpryc commented Dec 6, 2024

mateusoliveira43 commented Dec 9, 2024

mpryc commented Dec 10, 2024

mateusoliveira43 commented Dec 11, 2024

mateusoliveira43 Dec 11, 2024

mpryc Dec 11, 2024

mpryc Dec 11, 2024

mateusoliveira43 Dec 11, 2024

mpryc Dec 11, 2024

mpryc Dec 11, 2024

mateusoliveira43 Dec 11, 2024

mpryc Dec 11, 2024

mateusoliveira43 Dec 11, 2024

mpryc Dec 11, 2024

mateusoliveira43 Dec 11, 2024

mateusoliveira43 Dec 12, 2024

mpryc Dec 12, 2024

mpryc Dec 12, 2024

shubham-pampattiwar Dec 12, 2024

mpryc Dec 17, 2024

mateusoliveira43 Dec 18, 2024

mpryc Dec 18, 2024

mateusoliveira43 Dec 19, 2024

mateusoliveira43 Dec 19, 2024

mateusoliveira43 Dec 19, 2024

	func GetActiveVeleroBackupsByLabel(ctx context.Context, clientInstance client.Client, namespace, labelKey, labelValue string) ([]velerov1.Backup, error) {
	var veleroBackupList velerov1.BackupList
	labelSelector := client.MatchingLabels{labelKey: labelValue}

	if err := clientInstance.List(ctx, &veleroBackupList, client.InNamespace(namespace), labelSelector); err != nil {
	return nil, err
	}

	// Filter out backups with a CompletionTimestamp
	var activeBackups []velerov1.Backup
	for _, backup := range veleroBackupList.Items {
	if backup.Status.CompletionTimestamp == nil {
	activeBackups = append(activeBackups, backup)
	}
	}

	if len(activeBackups) == 0 {
	return nil, nil
	}

	return activeBackups, nil
	}

	func (p VeleroBackupQueuePredicate) Update(ctx context.Context, evt event.UpdateEvent) bool {
	logger := function.GetLogger(ctx, evt.ObjectNew, "VeleroBackupQueuePredicate")

	// Ensure the new and old objects are of the expected type
	newBackup, okNew := evt.ObjectNew.(*velerov1.Backup)
	oldBackup, okOld := evt.ObjectOld.(*velerov1.Backup)

	if !okNew \|\| !okOld {
	logger.V(1).Info("Rejected Backup Update event: invalid object type")
	return false
	}

	namespace := newBackup.GetNamespace()

	if namespace == p.OADPNamespace {
	if oldBackup.Status.CompletionTimestamp == nil && newBackup.Status.CompletionTimestamp != nil {
	logger.V(1).Info("Accepted Backup Update event: new completion timestamp")
	return true
	}
	}

	logger.V(1).Info("Rejected Backup Update event: no changes to the CompletionTimestamp in the VeleroBackup object")
	return false
	}

	Watches(&velerov1.Backup{}, &handler.VeleroBackupHandler{}).
	Watches(&velerov1.Backup{}, &handler.VeleroBackupQueueHandler{

Implementation for the restore queue #128

Are you sure you want to change the base?

Implementation for the restore queue #128

Conversation

mpryc commented Dec 6, 2024 • edited Loading

Why the changes were made

How to test the changes made

openshift-ci bot commented Dec 6, 2024

openshift-ci bot commented Dec 6, 2024

mpryc commented Dec 6, 2024

mateusoliveira43 commented Dec 9, 2024

mpryc commented Dec 10, 2024

mateusoliveira43 commented Dec 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mpryc commented Dec 6, 2024 •

edited

Loading