-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation for the restore queue #128
base: master
Are you sure you want to change the base?
Conversation
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mpryc The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@mateusoliveira43 note the file where I've moved the common types. Let me know if that is fine, currently we were mixing backup types inside restore object: https://github.com/migtools/oadp-non-admin/pull/128/files#diff-bce3cc24a9a3be49f46a28f3a11baff2a1f07f5e735a1dbc683e535e1aaf625d |
@mpryc it is fine I would change name to something like |
Adds the estimated queue to the NAB restore object Signed-off-by: Michal Pryc <[email protected]>
Test coverage for the status queue within restore operation. Signed-off-by: Michal Pryc <[email protected]>
CRD updates for the following non-admin PR: migtools/oadp-non-admin#128 Signed-off-by: Michal Pryc <[email protected]>
Blocked by: openshift/oadp-operator#1607 |
@mpryc avoid using |
var veleroRestoreList velerov1.RestoreList | ||
labelSelector := client.MatchingLabels{labelKey: labelValue} | ||
|
||
if err := clientInstance.List(ctx, &veleroRestoreList, client.InNamespace(namespace), labelSelector); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what you want here is to get all NAC "owned" velero restores?
I suggest doing like this
func GetActiveOwnedVeleroRestores(ctx context.Context, clientInstance client.Client, namespace string) ([]velerov1.Restore, error) {
var veleroRestoreList velerov1.RestoreList
if err := ListObjectsByLabel(ctx, clientInstance, namespace, constant.ManagedByLabel, constant.ManagedByLabelValue, veleroRestoreList); err != nil {
return nil, err
}
var activeRestores []velerov1.Restore
for _, restore := range veleroRestoreList.Items {
if restore.Status.CompletionTimestamp == nil && CheckVeleroRestoreMetadata(restore) {
activeRestores = append(activeRestores, restore)
}
}
if len(activeRestores) == 0 {
return nil, nil
}
return activeRestores, nil
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want all the NAC owned restores that does not have restore.Status.CompletionTimestamp
, which means they are still subject to be handled. We are making queue number based on the list of the objects that are still subject to some work by Velero, the ones that are completed (has the CompletionTimestamp) are not valid anymore.
Will change the implementation to use the ListObjectsByLabel
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this also be changed? I don't want this to be any different? This PR/separate ?
oadp-non-admin/internal/common/function/function.go
Lines 246 to 267 in 56afada
func GetActiveVeleroBackupsByLabel(ctx context.Context, clientInstance client.Client, namespace, labelKey, labelValue string) ([]velerov1.Backup, error) { | |
var veleroBackupList velerov1.BackupList | |
labelSelector := client.MatchingLabels{labelKey: labelValue} | |
if err := clientInstance.List(ctx, &veleroBackupList, client.InNamespace(namespace), labelSelector); err != nil { | |
return nil, err | |
} | |
// Filter out backups with a CompletionTimestamp | |
var activeBackups []velerov1.Backup | |
for _, backup := range veleroBackupList.Items { | |
if backup.Status.CompletionTimestamp == nil { | |
activeBackups = append(activeBackups, backup) | |
} | |
} | |
if len(activeBackups) == 0 { | |
return nil, nil | |
} | |
return activeBackups, nil | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok doing change in this PR as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will leave as it is in this PR if you agree - changed the backup function and now the tests are failing showing always queue 0. I don't want to spent time in this PR to debug why. Current implementation works as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is what I checked (even without checking metadata):
func GetActiveOwnedVeleroBackups(ctx context.Context, clientInstance client.Client, namespace string) ([]velerov1.Backup, error) {
var veleroBackupList := &velerov1.BackupList{}
if err := ListObjectsByLabel(ctx, clientInstance, namespace, constant.ManagedByLabel, constant.ManagedByLabelValue, veleroBackupList); err != nil {
return nil, err
}
// Filter out backups with a CompletionTimestamp
var activeBackups []velerov1.Backup
for _, backup := range veleroBackupList.Items {
if backup.Status.CompletionTimestamp == nil {
activeBackups = append(activeBackups, backup)
}
}
if len(activeBackups) == 0 {
return nil, nil
}
return activeBackups, nil
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think we need new predicate and handler for queue
doing these changes on the current ones, does not give the same result?
diff --git a/internal/handler/velerorestore_handler.go b/internal/handler/velerorestore_handler.go
index 515de66..ff2b23c 100644
--- a/internal/handler/velerorestore_handler.go
+++ b/internal/handler/velerorestore_handler.go
@@ -21,6 +21,7 @@ import (
"k8s.io/apimachinery/pkg/types"
"k8s.io/client-go/util/workqueue"
+ "sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/event"
"sigs.k8s.io/controller-runtime/pkg/reconcile"
@@ -29,7 +30,10 @@ import (
)
// VeleroRestoreHandler contains event handlers for Velero Restore objects
-type VeleroRestoreHandler struct{}
+type VeleroRestoreHandler struct {
+ Client client.Client
+ OADPNamespace string
+}
// Create event handler
func (VeleroRestoreHandler) Create(_ context.Context, _ event.CreateEvent, _ workqueue.RateLimitingInterface) {
@@ -37,18 +41,42 @@ func (VeleroRestoreHandler) Create(_ context.Context, _ event.CreateEvent, _ wor
}
// Update event handler adds Velero Restore's NonAdminRestore to controller queue
-func (VeleroRestoreHandler) Update(ctx context.Context, evt event.UpdateEvent, q workqueue.RateLimitingInterface) {
+func (h VeleroRestoreHandler) Update(ctx context.Context, evt event.UpdateEvent, q workqueue.RateLimitingInterface) {
logger := function.GetLogger(ctx, evt.ObjectNew, "VeleroRestoreHandler")
annotations := evt.ObjectNew.GetAnnotations()
nonAdminRestoreName := annotations[constant.NarOriginNameAnnotation]
nonAdminRestoreNamespace := annotations[constant.NarOriginNamespaceAnnotation]
- q.Add(reconcile.Request{NamespacedName: types.NamespacedName{
- Name: nonAdminRestoreName,
- Namespace: nonAdminRestoreNamespace,
- }})
- logger.V(1).Info("Handled Update event")
+ if function.CheckVeleroRestoreMetadata(evt.ObjectNew) {
+ q.Add(reconcile.Request{NamespacedName: types.NamespacedName{
+ Name: nonAdminRestoreName,
+ Namespace: nonAdminRestoreNamespace,
+ }})
+ logger.V(1).Info("Handled Update event")
+ }
+
+ restores, err := function.GetActiveVeleroRestoresByLabel(ctx, h.Client, h.OADPNamespace)
+ if err != nil {
+ logger.Error(err, "Failed to get Velero Restores by label")
+ return
+ }
+
+ if restores != nil {
+ for _, restore := range restores {
+ annotations := restore.GetAnnotations()
+ originName := annotations[constant.NarOriginNameAnnotation]
+ originNamespace := annotations[constant.NarOriginNamespaceAnnotation]
+
+ if originName != nonAdminRestoreName || originNamespace != nonAdminRestoreNamespace {
+ logger.V(1).Info("Processing Queue update for the NonAdmin Restore referenced by Velero Restore", "Name", restore.Name, constant.NamespaceString, restore.Namespace, "CreatedAt", restore.CreationTimestamp)
+ q.Add(reconcile.Request{NamespacedName: types.NamespacedName{
+ Name: originName,
+ Namespace: originNamespace,
+ }})
+ }
+ }
+ }
}
// Delete event handler
diff --git a/internal/predicate/velerorestore_predicate.go b/internal/predicate/velerorestore_predicate.go
index 8dfdae8..e63e75b 100644
--- a/internal/predicate/velerorestore_predicate.go
+++ b/internal/predicate/velerorestore_predicate.go
@@ -19,6 +19,7 @@ package predicate
import (
"context"
+ velerov1 "github.com/vmware-tanzu/velero/pkg/apis/velero/v1"
"sigs.k8s.io/controller-runtime/pkg/event"
"github.com/migtools/oadp-non-admin/internal/common/function"
@@ -40,6 +41,12 @@ func (p VeleroRestorePredicate) Update(ctx context.Context, evt event.UpdateEven
logger.V(1).Info("Accepted Update event")
return true
}
+ newRestore, _ := evt.ObjectNew.(*velerov1.Restore)
+ oldRestore, _ := evt.ObjectOld.(*velerov1.Restore)
+ if oldRestore.Status.CompletionTimestamp == nil && newRestore.Status.CompletionTimestamp != nil {
+ logger.V(1).Info("Accepted Update event: Restore completion timestamp")
+ return true
+ }
}
logger.V(1).Info("Rejected Update event")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I think we should stick with handler/predicate approach for two reasons:
- Be similar with how backup works:
oadp-non-admin/internal/predicate/velerobackup_queue_predicate.go
Lines 37 to 60 in 56afada
func (p VeleroBackupQueuePredicate) Update(ctx context.Context, evt event.UpdateEvent) bool { logger := function.GetLogger(ctx, evt.ObjectNew, "VeleroBackupQueuePredicate") // Ensure the new and old objects are of the expected type newBackup, okNew := evt.ObjectNew.(*velerov1.Backup) oldBackup, okOld := evt.ObjectOld.(*velerov1.Backup) if !okNew || !okOld { logger.V(1).Info("Rejected Backup Update event: invalid object type") return false } namespace := newBackup.GetNamespace() if namespace == p.OADPNamespace { if oldBackup.Status.CompletionTimestamp == nil && newBackup.Status.CompletionTimestamp != nil { logger.V(1).Info("Accepted Backup Update event: new completion timestamp") return true } } logger.V(1).Info("Rejected Backup Update event: no changes to the CompletionTimestamp in the VeleroBackup object") return false } func (h VeleroBackupQueueHandler) Update(ctx context.Context, evt event.UpdateEvent, q workqueue.RateLimitingInterface) {
- In k8s the predicate comes before handler, so we won't be invoking reconcile for cases where we can save some calls, in the above we are saving when the
CompletionTimestamp
is set from nil to something, meaning the object is "done". Only those are interested for us in regards how queue works. Those objects may have came to queue before ones that needs updating queue info. Anything that happened later is not interested as those are put later in the queue, meaning those will not affect any other objects that are in front. This allows to save quite a bit of calls on the predicate level. Secondly you are removing annotations check from the handler, which makes reconcile responsible for it. Reconcile comes after handler, which is also making more calls instead of reducing them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made some tests out of a cluster, and I am afraid we need to merge them together
If queue predicate is triggered, both handlers are triggered, right?
If a admin backup triggers this, velero backup predicate will try to add NonAdminBackup with empty name and namespace to the queue, right? and this creates and endless error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow this use case?
The following handlers are triggered after predicates:
oadp-non-admin/internal/controller/nonadminbackup_controller.go
Lines 699 to 700 in 56afada
Watches(&velerov1.Backup{}, &handler.VeleroBackupHandler{}). | |
Watches(&velerov1.Backup{}, &handler.VeleroBackupQueueHandler{ |
So both handlers are triggered based on the composite predicate, each handler does different job.
This one only for the object which actually triggered the event:
https://github.com/migtools/oadp-non-admin/blob/master/internal/handler/velerobackup_handler.go#L41-L53
And this one for all other objects to update statuses:
https://github.com/migtools/oadp-non-admin/blob/master/internal/handler/velerobackup_handler.go#L41-L53
See the namespace thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if a backup not owned by NAC updates to completed, does that trigger the queue predicate? yes, right?
then, both handlers are called, and velerobackup_handler.go
will try to add NonAdminBackup with empty name and namespace to the queue, right? and this creates and endless error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will open PR for that then
Can you wait that one to continue on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really do prefer to have working and similar implementation on both parts. With test day proven this works I don't want to modify too much in this area at the moment as I want to focus on #36 and then improve if we find issues. This for me is not an issue, just small improvement in the implementation. Going off the reconcile is fine imo at this moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit more why I think it's really not that important at the moment to make this effort:
- We already have tested and working implementation, possibly there are areas to improve, but we need to focus on other parts (sync controller, nabsl)
- The current flow is pretty clean we have
a) predicates for each type of interesting event (first defence/filter against unnecessary reconciles)
b) separate handlers for velero events that works with:
Current NAB object:q.Add(reconcile.Request{NamespacedName: types.NamespacedName{
Other interested NAB objects: https://github.com/migtools/oadp-non-admin/blob/master/internal/handler/velerobackup_queue_handler.go#L74-L81
The two above are not causing the same NAB to be reconciled twice.
If we want to modify and check if velero object event is not triggering the reconcile on non-existing NAB we would need to move the check if NAB really exists to the handler, which is not the way it should work. Reconcile function is the proper place for this check, as it centralizes the logic, ensuring there is only one point where the NAB's existence is verified, rather than duplicating this check in the handler.
Aren't we rewriting yet one more time reconcile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mateusoliveira43 @mpryc Lets keep the restore queue work similar to what we had in backup queue. If there are cases that are not being covered with the current approach, lets create an issue for those and then the issues can be fixed in follow on PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mateusoliveira43 based on what @shubham-pampattiwar wrote, is there anything for me to get this merged? Please let me know so I can work on it if needed.
// TODO(migi): do we need estimatedQueuePosition in VeleroRestoreStatus? | ||
updatedQueueInfo := false | ||
|
||
// Determine how many Backups are scheduled before the given VeleroRestore in the OADP namespace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo Backups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Signed-off-by: Michal Pryc <[email protected]>
@@ -100,6 +100,13 @@ func checkTestNonAdminRestoreStatus(nonAdminRestore *nacv1alpha1.NonAdminRestore | |||
return fmt.Errorf("NonAdminRestore Status Conditions [%v] Message %v does not contain expected message %v", index, nonAdminRestore.Status.Conditions[index].Message, expectedStatus.Conditions[index].Message) | |||
} | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is wrong
If nonAdminRestore.Status.QueueInfo
is nil and expectedStatus.QueueInfo
is not nil, no error is raised.
This happens in the test case Should update NonAdminRestore until it invalidates and then delete it
. The QueueInfo
of that NAR will be nil after reconciliation, but expected is QueueInfo: &nacv1alpha1.QueueInfo{EstimatedQueuePosition: 0}
To fully compare nonAdminRestore.Status.QueueInfo
and expectedStatus.QueueInfo
code needs to be updated to something like this
if nonAdminRestore.Status.QueueInfo != nil {
if expectedStatus.QueueInfo == nil {
return fmt.Errorf("message")
}
if nonAdminRestore.Status.QueueInfo.EstimatedQueuePosition != expectedStatus.QueueInfo.EstimatedQueuePosition {
return fmt.Errorf("NonAdminRestore Status QueueInfo EstimatedQueuePosition %v is not equal to expected %v", nonAdminRestore.Status.QueueInfo.EstimatedQueuePosition, expectedStatus.QueueInfo.EstimatedQueuePosition)
}
} else {
if expectedStatus.QueueInfo != nil {
return fmt.Errorf("message")
}
}
@@ -357,6 +364,9 @@ var _ = ginkgo.Describe("Test full reconcile loop of NonAdminRestore Controller" | |||
LastTransitionTime: metav1.NewTime(time.Now()), | |||
}, | |||
}, | |||
QueueInfo: &nacv1alpha1.QueueInfo{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not know what happens if we tell Velero to restore a backup in progress. I suspect it fails
To make this test scenario more real, would change VeleroBackup.Status
to phase completed, have CompletionTimestamp and QueueInfo.EstimatedQueuePosition
to be zero
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no test for GetRestoreQueueInfo
function, right? would be interesting to add it?
CRD updates for the following non-admin PR: migtools/oadp-non-admin#128 Signed-off-by: Michal Pryc <[email protected]>
Adds the estimated queue to the NAB restore object
Why the changes were made
To include estimated queue number for the restore object.
How to test the changes made
make simulation-test