Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure MariaDBAccount gets created with MariaDBDatabase #858

Closed

Conversation

stuggi
Copy link
Contributor

@stuggi stuggi commented Sep 13, 2024

Currently the MariaDBAccount gets created in an early step before the password secret gets validated to be there. In case the service password is missing the deployment stops after the MariaDBAccount is there.
If one deletes the ctlplane at this point, the nova-api MariaDBAccount won't be deleted because the loadDatabaseAndAccountCRs() will not return the account because the MariaDBDatabase object was not created. With this the nova-api MariaDBAccount remains with a finalizer.

When the password secret now is created with a new ctlplane, the old nova-api MariaDBAccount conficts with the new deployment because it will not be created in the db instance and all nova tasks to initialize its DB fail with an access error.

This change moves creating the nova-api MariaDBAccount right before creating the MariaDBDatabase. This reduces the situation that there will be a MariaDBAccount for nova-api without its MariaDBDatabase.

Currently this situation could also happen when the service password is there, but galera is not created properly, like DB root pwd missing.

Jira: OSPRH-10167

Currently the MariaDBAccount gets created in an early step before
the password secret gets validated to be there. In case the service
password is missing the deployment stops after the MariaDBAccount
is there.
If one deletes the ctlplane at this point, the nova-api MariaDBAccount
won't be deleted because the loadDatabaseAndAccountCRs() will not
return the account because the MariaDBDatabase object was not created.
With this the nova-api MariaDBAccount remains with a finalizer.

When the password secret now is created with a new ctlplane, the
old nova-api MariaDBAccount conficts with the new deployment because
it will not be created in the db instance and all nova tasks to
initialize its DB fail with an access error.

This change moves creating the nova-api MariaDBAccount right before
creating the MariaDBDatabase. This reduces the situation that there
will be a MariaDBAccount for nova-api without its MariaDBDatabase.

Currently this situation could also happen when the service password
is there, but galera is not created properly, like DB root pwd missing.

Jira: OSPRH-10167

Signed-off-by: Martin Schuppert <[email protected]>
Copy link
Contributor

openshift-ci bot commented Sep 13, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: stuggi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@stuggi stuggi requested review from zzzeek, gibizer and mrkisaolamb and removed request for frenzyfriday September 13, 2024 12:55
@zzzeek
Copy link
Contributor

zzzeek commented Sep 13, 2024

so here's the thing (well two things).

  • Issue one

    There was a notion that the MariaDBAccount creation would eventually move out of the operators entirely and be part of openstack-operator, where the accounts would all be created before the CRs for each sub-operator were even invoked.

    Having the MariaDBAccount creation / validation "early" in the operators here was meant as a kind of dress rehearsal for this flow.

    So if there is a fundamental problem with this flow (I find it a bit hard to follow so far), that's a bigger issue.

    What happens if openstack operator creates all the mariadbaccounts ahead of time?

  • Issue two

    As implied by the above, all of the operators use this flow right now. Does this issue apply to all operators? Can we have a kuttl test that exercises the flow that you are indicating is failing right now?

@stuggi
Copy link
Contributor Author

stuggi commented Sep 13, 2024

so here's the thing (well two things).

  • Issue one
    There was a notion that the MariaDBAccount creation would eventually move out of the operators entirely and be part of openstack-operator, where the accounts would all be created before the CRs for each sub-operator were even invoked.
    Having the MariaDBAccount creation / validation "early" in the operators here was meant as a kind of dress rehearsal for this flow.
    So if there is a fundamental problem with this flow (I find it a bit hard to follow so far), that's a bigger issue.
    What happens if openstack operator creates all the mariadbaccounts ahead of time?
  • Issue two
    As implied by the above, all of the operators use this flow right now. Does this issue apply to all operators? Can we have a kuttl test that exercises the flow that you are indicating is failing right now?

afaik this is currently specific to nova because:

@zzzeek
Copy link
Contributor

zzzeek commented Sep 13, 2024

I guess im not following "In case the service password is missing" , what's "the service password" here? EnsureMariaDBAccount generates a password and a secret so that part is not missing

@zzzeek
Copy link
Contributor

zzzeek commented Sep 13, 2024

or otherwise I'm just curious to know how to reproduce the problem. the PR is fine

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/76adc48b245a4e0d92e1bb9310369723

✔️ openstack-meta-content-provider SUCCESS in 1h 57m 56s
✔️ nova-operator-kuttl SUCCESS in 42m 47s
nova-operator-tempest-multinode FAILURE in 1h 26m 44s
nova-operator-tempest-multinode-ceph FAILURE in 1h 22m 11s

@stuggi
Copy link
Contributor Author

stuggi commented Sep 13, 2024

I guess im not following "In case the service password is missing" , what's "the service password" here? EnsureMariaDBAccount generates a password and a secret so that part is not missing

sorry with service password I meant the OSP user/password you provide in the osp secret. you can easy reproduce it as I mentioned in the jira. just

  • don't create the osp-secret
  • create a ctlplane
  • nova will create the mariadb account
  • delete the ctlplane
  • the mariadb account remains

@SeanMooney
Copy link
Contributor

so the keystone password is ment to come form the secrete but the database passowrkd will eventually not use that at all and be passed into us

@SeanMooney
Copy link
Contributor

SeanMooney commented Sep 13, 2024

by the way, as @zzzeek noted this code should not be required in the ga release
it was here to aid in the transition to the MariaDB instance being passed into the nova-operator by the OpenStack operator. we probably should have deleted this already prior to ga.
@gibizer @mrkisaolamb do we have everything in place to remove the account creation within the nova operator at this point? i don't recall if we have completed that refactor or not.

assuming we have I'm inliced to just delete this code entirely as non of our docs should be using this and we did not intent support creating them this way in 18.0 ga

@zzzeek
Copy link
Contributor

zzzeek commented Sep 13, 2024

we put MariaDBAccount creation in openstack-operator? I missed that

@zzzeek
Copy link
Contributor

zzzeek commented Sep 13, 2024

you cant remove this code entirely as it handles the case where the MariaDBAccount CR attribute changes to a new name, this routine then generates a new username/password for that new name. this is in the documentation

@zzzeek
Copy link
Contributor

zzzeek commented Sep 13, 2024

unless we move that to openstack-operator also...

@stuggi
Copy link
Contributor Author

stuggi commented Sep 16, 2024

don't think any code landed in the openstack-operator to pre-create the accounts.

@gibizer
Copy link
Contributor

gibizer commented Sep 16, 2024

Yeah I don't think either that the MariaDBAccount creation is moved to the openstack-operator.

@gibizer
Copy link
Contributor

gibizer commented Sep 16, 2024

Looking into the problem I think the issue is that loadDatabaseAndAccountCRs() assumes that both the MariaDBDatabase and the MariaDBAccount objects exists at the same time (or not present at the same time). But as these two are two separate CRs they can exists independently. One example of such situation is described by @stuggi in this PR above, but there could be other cases, like any failure that prevents MariaDBDatabase creation followed by a ctrlplane deletion will lead to stuck MariaDBAccounts.

Therefore this change is just a band aid but does not remove the root cause of the issue. I think we should separate the handling of the two CRs into two set of functions to avoid that if one is missing then the other is ignored. I think we can keep the Database struct if we want but we should at least needs to make GetAccount() and GetDatabase() to load the CRs independently and replace the usage of GetDatabaseByNameAndAccount() in the service operators with those independent calls.

@stuggi
Copy link
Contributor Author

stuggi commented Sep 16, 2024

Looking into the problem I think the issue is that loadDatabaseAndAccountCRs() assumes that both the MariaDBDatabase and the MariaDBAccount objects exists at the same time (or not present at the same time). But as these two are two separate CRs they can exists independently. One example of such situation is described by @stuggi in this PR above, but there could be other cases, like any failure that prevents MariaDBDatabase creation followed by a ctrlplane deletion will lead to stuck MariaDBAccounts.

yes, thats actually what I tried to describe in my description of the pr what the issue is with loadDatabaseAndAccountCRs().

Therefore this change is just a band aid but does not remove the root cause of the issue. I think we should separate the handling of the two CRs into two set of functions to avoid that if one is missing then the other is ignored. I think we can keep the Database struct if we want but we should at least needs to make GetAccount() and GetDatabase() to load the CRs independently and replace the usage of GetDatabaseByNameAndAccount() in the service operators with those independent calls.

That is correct, we have to fix the underlying issue, but this PR would at least make the nova-operator to behave like the others.

@stuggi
Copy link
Contributor Author

stuggi commented Sep 16, 2024

recheck

@gibizer
Copy link
Contributor

gibizer commented Sep 16, 2024

Therefore this change is just a band aid but does not remove the root cause of the issue. I think we should separate the handling of the two CRs into two set of functions to avoid that if one is missing then the other is ignored. I think we can keep the Database struct if we want but we should at least needs to make GetAccount() and GetDatabase() to load the CRs independently and replace the usage of GetDatabaseByNameAndAccount() in the service operators with those independent calls.

That is correct, we have to fix the underlying issue, but this PR would at least make the nova-operator to behave like the others.

If there is an agreement that we will fix the underlying issue then I'm fine merging this PR as a temporary measure.

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/29beb6e4303f4042b0ba810fee60b6d1

✔️ openstack-meta-content-provider SUCCESS in 1h 35m 53s
✔️ nova-operator-kuttl SUCCESS in 39m 27s
nova-operator-tempest-multinode FAILURE in 1h 11m 15s
nova-operator-tempest-multinode-ceph FAILURE in 1h 21m 16s

@zzzeek
Copy link
Contributor

zzzeek commented Sep 16, 2024

Therefore this change is just a band aid but does not remove the root cause of the issue. I think we should separate the handling of the two CRs into two set of functions to avoid that if one is missing then the other is ignored. I think we can keep the Database struct if we want but we should at least needs to make GetAccount() and GetDatabase() to load the CRs independently and replace the usage of GetDatabaseByNameAndAccount() in the service operators with those independent calls.

So good news it looks like GetDatabaseByNameAndAccount across various operators seems to only be in the delete flow now, which is a pretty limited spot.

If the actual issue here is simply that the deletion needs to locate the MariaDBDatabase and MariaDBAccount separately so that the MariaDBAccount gets deleted in the absense of MariaDBDatabase, why dont we change that directly here and leave the EnsureMariaDBAccount part of things where it is?

Basically I can undertake the task of splitting out GetDatabaseByNameAndAccount across all operators and we can deprecate that function call.

@zzzeek
Copy link
Contributor

zzzeek commented Sep 17, 2024

I've implemented this approach in #862 for nova, making use of a new DeleteDatabaseAndAccountFinalizers function added in openstack-k8s-operators/mariadb-operator#268 . I've confirmed both that I can reproduce the original issue as well as that this new approach solves the problem (and should work in all cases), without the need to change anything about how the mariadbaccount/mariadbdatabase are created up front.

it needs a new mariadb API version but once we confirm that's good to merge and get it in, it's a two line change for all operators that are using the old pattern.

@stuggi
Copy link
Contributor Author

stuggi commented Oct 7, 2024

closing in favor of #862

@stuggi stuggi closed this Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants