-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: NAC install validation #1577
base: master
Are you sure you want to change the base?
fix: NAC install validation #1577
Conversation
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mateusoliveira43 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
controllers/nonadmin_controller.go
Outdated
return false, err | ||
} | ||
dpaList := &oadpv1alpha1.DataProtectionApplicationList{} | ||
err = r.List(r.Context, dpaList, &client.ListOptions{FieldSelector: selector}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A corner case here should be if 2 (or more) DPAs are created at the same time. But since DPA reconciles every minute, that's not to bad? (DPAs would have error message in their statuses, but multiple NAC would also exist in the cluster. To really avoid this, before erroring out, could delete NAC deployment)
Another question here: since DPA reconciles every minute, its too much to do a cluster wide get call every minute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would compare creationTimestamp. when a second reconcile happen if ever, the one with later creationTimestamp would remove its created NAC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or a more drastic change would be to move this controller out of the current ReconcileBatch into its own thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking again, I think we do not need to worry about having multiple NAC deployments after validation
-
If 2 DPAs with NAC enabled are created at the same time, when they reach this part, both of them will not deploy NAC
-
If one DPA is created and after some time another is created, only first DPA NAC will be deployed
Right?
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
@shubham-pampattiwar it's not clear to me we've come to a conclusion on 1 NAC per cluster vs. a NAC married to a particular OADP deployment. If this is a temporary measure until we can put in the code to pair an oadp deployment with a NAC than I think this is fine. |
This is not a temporary measure. The current PR implements the behavior that:
|
Upon installing OADP in another namespace w/ NAC enabled.. A user gets the following error message.
|
/test ci/prow/4.18-dev-preview-e2e-test-aws |
@weshayutin: The specified target(s) for
The following commands are available to trigger optional jobs:
Use In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
controllers/nonadmin_controller.go
Outdated
} | ||
for _, dpa := range dpaList.Items { | ||
if (&DPAReconciler{dpa: &dpa}).checkNonAdminEnabled() { | ||
return false, fmt.Errorf("only a single instance of Non-Admin Controller can be installed across the entire cluster. Non-Admin controller is also configured to be installed in %s namespace", dpa.Namespace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets re-frame the error text a little bit: only a single instance of Non-Admin Controller can be installed across the entire cluster. Non-Admin controller is already configured and installed in %s namespace", dpa.Namespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to keep as it is
from what I tested, NAC will never be deployed 2 or more times. Either one is installed (one was installed and now a second is trying to be installed) or none (2 or more are trying to be installed at the same time)
So, your suggestion would make sense to the case of the DPA that tried to install NAC and it already exists in cluster, but not for the DPA that already installed it, or other cases (error messages will appear in all DPAs, because DPA reconciles every minute). Do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
User does the following:
- Creates DAP 1
- Enables NAC for DPA 1
- Now Another NS, OADP is installed and DPA 2 is created there
- User enabled NAC for DPA 2
I want error message only on DPA 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To have error only on "wrong" DPAs, I would need to know if NAC is already deployed (do a GET call for deployments in OADP namespace) prior to erroring out. Would this not complicate validation too much? (and every minute, DPA would do GET call)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DPA 1 did nothing wrong, why should DPA 1 be errored out ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just because we can not be sure that error is not in DPA 1
examples
-
DPA 1 is created first with NAC enabled -> DPA 2 created after with NAC enabled -> by doing a deployment get call we can add error only to DPA 2 ✅
-
DPA 1 and DPA 2 are created (or edited) at the same time with NAC enabled -> in this case, which DPA we error out? ❌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For case 2, we don't decide that, let the DPA controller do its work. Either of the DPAs will get the error for sure, right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could create cluster-wide ConfigMap Lock in the kube-system
namespace to ensure one lease/lock is acquired by running NAC or using lease for that until nac is gone https://kubernetes.io/docs/concepts/architecture/leases/.
Then during DPA reconcile check if this lease/lock is acquired and if yes ensure it's actually not orphant and then run the NAC.
Another common option is actually to use leader election to achieve this, but this would be for entire operator rather then one controller unless there is a way to dynamically set it during DPA reconcile:
https://docs.redhat.com/en/documentation/openshift_container_platform/4.17/html/operators/developing-operators#osdk-leader-election
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shubham-pampattiwar that is my fear. I think with your suggested approach, both will deploy NAC
@mpryc NAC has leader election, I will test what happens
/test 4.18-dev-preview-e2e-test-aws |
dpa.Spec.NonAdmin.Enable != nil { | ||
return *dpa.Spec.NonAdmin.Enable | ||
if r.dpa.Spec.NonAdmin != nil && r.dpa.Spec.NonAdmin.Enable != nil { | ||
return *r.dpa.Spec.NonAdmin.Enable && r.dpa.Spec.UnsupportedOverrides[oadpv1alpha1.TechPreviewAck] == TrueVal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to ensure the comparison is against "True", "TRUE" or "true", so regardless of capital/small letters?
import (
"strings"
[...]
return *r.dpa.Spec.NonAdmin.Enable && strings.EqualFold(r.dpa.Spec.UnsupportedOverrides[oadpv1alpha1.TechPreviewAck], TrueVal)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need, the error message if user is not using right value tells how to set it up
oadp-operator/controllers/validator.go
Line 78 in 4b043ba
return false, errors.New("in order to enable/disable the non-admin feature please set dpa.spec.unsupportedOverrides[tech-preview-ack]: 'true'") |
if r.checkNonAdminEnabled() { | ||
if !(dpa.Spec.UnsupportedOverrides[oadpv1alpha1.TechPreviewAck] == TrueVal) { | ||
if r.dpa.Spec.NonAdmin != nil && r.dpa.Spec.NonAdmin.Enable != nil && *r.dpa.Spec.NonAdmin.Enable { | ||
if !(r.dpa.Spec.UnsupportedOverrides[oadpv1alpha1.TechPreviewAck] == TrueVal) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same ask as above (comparison against capital/small letters) ?
/hold as discussed offline NAC controller check should be moved to validations to ensure there won't be situation where "wrong" DPA will allow velero pod to be running and then fail DPA reconcile with error. |
Signed-off-by: Mateus Oliveira <[email protected]>
improve error messages Signed-off-by: Mateus Oliveira <[email protected]>
add tests Signed-off-by: Mateus Oliveira <[email protected]>
working example Signed-off-by: Mateus Oliveira <[email protected]>
use another client Signed-off-by: Mateus Oliveira <[email protected]>
validate first Signed-off-by: Mateus Oliveira <[email protected]>
6a598cc
to
8bc854d
Compare
/unhold |
@mateusoliveira43: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Why the changes were made
Only allow one NAC install per cluster.
Related to migtools/oadp-non-admin#107
Also, add validation for only allowing one DPA per OADP installation namespace.
How to test the changes made
Deploy 2 (or more) OADP operators (
make deploy-olm
) in different namespaces. Create 2 (or more) DPAs in those namespaces with NAC enabled. Only one NAC should be deployed and DPAs should have error in their statuses.