Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: NAC install validation #1577

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

mateusoliveira43
Copy link
Contributor

@mateusoliveira43 mateusoliveira43 commented Oct 25, 2024

Why the changes were made

Only allow one NAC install per cluster.

Related to migtools/oadp-non-admin#107

Also, add validation for only allowing one DPA per OADP installation namespace.

How to test the changes made

Deploy 2 (or more) OADP operators (make deploy-olm) in different namespaces. Create 2 (or more) DPAs in those namespaces with NAC enabled. Only one NAC should be deployed and DPAs should have error in their statuses.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 25, 2024
Copy link

openshift-ci bot commented Oct 25, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link

openshift-ci bot commented Oct 25, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mateusoliveira43

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 25, 2024
return false, err
}
dpaList := &oadpv1alpha1.DataProtectionApplicationList{}
err = r.List(r.Context, dpaList, &client.ListOptions{FieldSelector: selector})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A corner case here should be if 2 (or more) DPAs are created at the same time. But since DPA reconciles every minute, that's not to bad? (DPAs would have error message in their statuses, but multiple NAC would also exist in the cluster. To really avoid this, before erroring out, could delete NAC deployment)

Another question here: since DPA reconciles every minute, its too much to do a cluster wide get call every minute?

Copy link
Member

@kaovilai kaovilai Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would compare creationTimestamp. when a second reconcile happen if ever, the one with later creationTimestamp would remove its created NAC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or a more drastic change would be to move this controller out of the current ReconcileBatch into its own thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking again, I think we do not need to worry about having multiple NAC deployments after validation

  • If 2 DPAs with NAC enabled are created at the same time, when they reach this part, both of them will not deploy NAC

  • If one DPA is created and after some time another is created, only first DPA NAC will be deployed

Right?

controllers/validator.go Outdated Show resolved Hide resolved
controllers/validator.go Show resolved Hide resolved
controllers/validator.go Outdated Show resolved Hide resolved
controllers/nonadmin_controller.go Outdated Show resolved Hide resolved
@mateusoliveira43 mateusoliveira43 marked this pull request as ready for review November 5, 2024 19:52
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 5, 2024
@mateusoliveira43

This comment was marked as outdated.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 5, 2024
@mateusoliveira43

This comment was marked as outdated.

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 12, 2024
@weshayutin
Copy link
Contributor

@shubham-pampattiwar it's not clear to me we've come to a conclusion on 1 NAC per cluster vs. a NAC married to a particular OADP deployment. If this is a temporary measure until we can put in the code to pair an oadp deployment with a NAC than I think this is fine.

@shubham-pampattiwar
Copy link
Member

This is not a temporary measure. The current PR implements the behavior that:

  • Only one NAC exists per cluster
  • you can install multiple OADP Operators in the cluster but only of them can deploy/enable NAC feature

@weshayutin
Copy link
Contributor

Upon installing OADP in another namespace w/ NAC enabled.. A user gets the following error message.

 lastTransitionTime: '2024-11-13T01:54:18Z'
      message: only a single instance of Non-Admin Controller can be installed across the entire cluster. Non-Admin controller is also configured to be installed in openshift-adp namespace
      reason: Error

@weshayutin
Copy link
Contributor

/test ci/prow/4.18-dev-preview-e2e-test-aws

Copy link

openshift-ci bot commented Nov 13, 2024

@weshayutin: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test 4.14-ci-index
  • /test 4.14-e2e-test-aws
  • /test 4.14-e2e-test-kubevirt-aws
  • /test 4.14-images
  • /test 4.15-ci-index
  • /test 4.15-e2e-test-aws
  • /test 4.15-e2e-test-kubevirt-aws
  • /test 4.15-images
  • /test 4.16-ci-index
  • /test 4.16-e2e-test-aws
  • /test 4.16-e2e-test-kubevirt-aws
  • /test 4.16-images
  • /test 4.17-ci-index
  • /test 4.17-e2e-test-aws
  • /test 4.17-e2e-test-kubevirt-aws
  • /test 4.17-images
  • /test 4.18-dev-preview-ci-index
  • /test 4.18-dev-preview-images
  • /test images
  • /test unit-test

The following commands are available to trigger optional jobs:

  • /test 4.18-dev-preview-e2e-test-aws

Use /test all to run all jobs.

In response to this:

/test ci/prow/4.18-dev-preview-e2e-test-aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

}
for _, dpa := range dpaList.Items {
if (&DPAReconciler{dpa: &dpa}).checkNonAdminEnabled() {
return false, fmt.Errorf("only a single instance of Non-Admin Controller can be installed across the entire cluster. Non-Admin controller is also configured to be installed in %s namespace", dpa.Namespace)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets re-frame the error text a little bit: only a single instance of Non-Admin Controller can be installed across the entire cluster. Non-Admin controller is already configured and installed in %s namespace", dpa.Namespace

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to keep as it is

from what I tested, NAC will never be deployed 2 or more times. Either one is installed (one was installed and now a second is trying to be installed) or none (2 or more are trying to be installed at the same time)

So, your suggestion would make sense to the case of the DPA that tried to install NAC and it already exists in cluster, but not for the DPA that already installed it, or other cases (error messages will appear in all DPAs, because DPA reconciles every minute). Do you agree?

Copy link
Member

@shubham-pampattiwar shubham-pampattiwar Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User does the following:

  • Creates DAP 1
  • Enables NAC for DPA 1
  • Now Another NS, OADP is installed and DPA 2 is created there
  • User enabled NAC for DPA 2
    I want error message only on DPA 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To have error only on "wrong" DPAs, I would need to know if NAC is already deployed (do a GET call for deployments in OADP namespace) prior to erroring out. Would this not complicate validation too much? (and every minute, DPA would do GET call)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DPA 1 did nothing wrong, why should DPA 1 be errored out ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because we can not be sure that error is not in DPA 1

examples

  • DPA 1 is created first with NAC enabled -> DPA 2 created after with NAC enabled -> by doing a deployment get call we can add error only to DPA 2 ✅

  • DPA 1 and DPA 2 are created (or edited) at the same time with NAC enabled -> in this case, which DPA we error out? ❌

Copy link
Member

@shubham-pampattiwar shubham-pampattiwar Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For case 2, we don't decide that, let the DPA controller do its work. Either of the DPAs will get the error for sure, right ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could create cluster-wide ConfigMap Lock in the kube-system namespace to ensure one lease/lock is acquired by running NAC or using lease for that until nac is gone https://kubernetes.io/docs/concepts/architecture/leases/.

Then during DPA reconcile check if this lease/lock is acquired and if yes ensure it's actually not orphant and then run the NAC.

Another common option is actually to use leader election to achieve this, but this would be for entire operator rather then one controller unless there is a way to dynamically set it during DPA reconcile:
https://docs.redhat.com/en/documentation/openshift_container_platform/4.17/html/operators/developing-operators#osdk-leader-election

Copy link
Contributor Author

@mateusoliveira43 mateusoliveira43 Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shubham-pampattiwar that is my fear. I think with your suggested approach, both will deploy NAC

@mpryc NAC has leader election, I will test what happens

@kaovilai
Copy link
Member

/test 4.18-dev-preview-e2e-test-aws

dpa.Spec.NonAdmin.Enable != nil {
return *dpa.Spec.NonAdmin.Enable
if r.dpa.Spec.NonAdmin != nil && r.dpa.Spec.NonAdmin.Enable != nil {
return *r.dpa.Spec.NonAdmin.Enable && r.dpa.Spec.UnsupportedOverrides[oadpv1alpha1.TechPreviewAck] == TrueVal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to ensure the comparison is against "True", "TRUE" or "true", so regardless of capital/small letters?

import (
	"strings"
[...]

		return *r.dpa.Spec.NonAdmin.Enable && strings.EqualFold(r.dpa.Spec.UnsupportedOverrides[oadpv1alpha1.TechPreviewAck], TrueVal)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need, the error message if user is not using right value tells how to set it up

return false, errors.New("in order to enable/disable the non-admin feature please set dpa.spec.unsupportedOverrides[tech-preview-ack]: 'true'")

if r.checkNonAdminEnabled() {
if !(dpa.Spec.UnsupportedOverrides[oadpv1alpha1.TechPreviewAck] == TrueVal) {
if r.dpa.Spec.NonAdmin != nil && r.dpa.Spec.NonAdmin.Enable != nil && *r.dpa.Spec.NonAdmin.Enable {
if !(r.dpa.Spec.UnsupportedOverrides[oadpv1alpha1.TechPreviewAck] == TrueVal) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same ask as above (comparison against capital/small letters) ?

@mpryc
Copy link
Contributor

mpryc commented Nov 13, 2024

/hold as discussed offline NAC controller check should be moved to validations to ensure there won't be situation where "wrong" DPA will allow velero pod to be running and then fail DPA reconcile with error.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 13, 2024
Signed-off-by: Mateus Oliveira <[email protected]>
improve error messages

Signed-off-by: Mateus Oliveira <[email protected]>
add tests

Signed-off-by: Mateus Oliveira <[email protected]>
working example

Signed-off-by: Mateus Oliveira <[email protected]>
use another client

Signed-off-by: Mateus Oliveira <[email protected]>
validate first

Signed-off-by: Mateus Oliveira <[email protected]>
@mateusoliveira43
Copy link
Contributor Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 13, 2024
Copy link

openshift-ci bot commented Nov 13, 2024

@mateusoliveira43: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/4.17-e2e-test-aws 8bc854d link true /test 4.17-e2e-test-aws
ci/prow/4.17-e2e-test-kubevirt-aws 8bc854d link true /test 4.17-e2e-test-kubevirt-aws

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants