Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scenarios for 17.1 env for adoption #597

Merged
merged 1 commit into from
Oct 8, 2024

Conversation

cescgina
Copy link
Contributor

@cescgina cescgina commented Sep 10, 2024

Introduce a scenarios folder that will contain the needed input to
deploy a 17.1 environment using the cifmw role added in [1]. The
scenario is defined by a variable file, with undercloud specific
parameters, overcloud specific parameters, hooks that can be called
before or after both the undercloud and overcloud deployment, and two
maps that relate the groups in inventory produced by the playbook that
created the infra, to Roles and roles hostnames, to make it easier to
work with different roles in different scenarios.

[1] openstack-k8s-operators/ci-framework#2297

osp-computes: "overcloud-computehci"
osp-controllers: "overcloud-controller"
overcloud:
stackname: "overcloud"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks great, and allows multi-cell / dcn deployments in the future!

@jistr
Copy link
Contributor

jistr commented Sep 13, 2024

👍

scenarios/hci.yaml Outdated Show resolved Hide resolved
@cescgina cescgina force-pushed the add_osp17_1_scenarios branch 3 times, most recently from ccaa33f to bc5992e Compare September 20, 2024 13:53
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/e04c2741d5604ca78867359472a4e21d

✔️ adoption-standalone-to-crc-ceph SUCCESS in 2h 33m 31s
adoption-standalone-to-crc-no-ceph FAILURE in 2h 06m 45s

scenarios/hci.yaml Outdated Show resolved Hide resolved
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/a9cbacb4f4ee4665a931c153b5d3d154

adoption-standalone-to-crc-ceph TIMED_OUT in 4h 16m 12s
adoption-standalone-to-crc-no-ceph TIMED_OUT in 4h 16m 42s

@cescgina cescgina force-pushed the add_osp17_1_scenarios branch 2 times, most recently from 1e2a587 to 589a436 Compare September 28, 2024 14:05
cescgina added a commit to openstack-k8s-operators/ci-framework that referenced this pull request Sep 30, 2024
Add a new role that will deploy a tripleo environment that will serve as
source for adoption. This role is expected to cosume the infra created
by [1], and a 17.1 scenario definition from the data-plane-adoption
repo, introduced by [2].

It also introduce a small fix to the deploy-ocp.yml so the resulting ocp
cluster is ready (the nodes needed to be uncordoned).

[1] #2285
[2] openstack-k8s-operators/data-plane-adoption#597
cescgina added a commit to openstack-k8s-operators/ci-framework that referenced this pull request Sep 30, 2024
Add a new role that will deploy a tripleo environment that will serve as
source for adoption. This role is expected to cosume the infra created
by [1], and a 17.1 scenario definition from the data-plane-adoption
repo, introduced by [2].

It also introduce a small fix to the deploy-ocp.yml so the resulting ocp
cluster is ready (the nodes needed to be uncordoned).

[1] #2285
[2] openstack-k8s-operators/data-plane-adoption#597
@cescgina cescgina changed the title [POC] Add scenarios for 17.1 env for adoption Add scenarios for 17.1 env for adoption Sep 30, 2024
@cescgina cescgina marked this pull request as ready for review September 30, 2024 10:08
cescgina added a commit to openstack-k8s-operators/ci-framework that referenced this pull request Sep 30, 2024
Add a new role that will deploy a tripleo environment that will serve as
source for adoption. This role is expected to cosume the infra created
by [1], and a 17.1 scenario definition from the data-plane-adoption
repo, introduced by [2].

It also introduce a small fix to the deploy-ocp.yml so the resulting ocp
cluster is ready (the nodes needed to be uncordoned).

[1] #2285
[2] openstack-k8s-operators/data-plane-adoption#597
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/4037787d89be4b339a782b7567e1ca68

adoption-standalone-to-crc-ceph FAILURE in 2h 34m 28s
✔️ adoption-standalone-to-crc-no-ceph SUCCESS in 2h 40m 10s

@cescgina
Copy link
Contributor Author

recheck

mtu: 1500
vip: true
name_lower: storage
dns_domain: storage.mydomain.tld.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the TLS support (eventually) we'll need to jinjafy this for the domain

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was working under this assumption that this files should be entirely static, i.e. there would be a 1:1 mapping of scenario and configuration. But your comment make we wonder, will we have a tlse and a non-tlse version of some scenario, or will we have two different scenarios, one for each case? I would go for the latter, if need to change the setup, we should create a new scenario, even if we incur in some duplication

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cescgina the problem I see is that TLS-e is enabled by default in the topology you land (unigamma). You can choose to disable it in the ctlplane/dataplane CRs, or make it configurable.
I don't think we want to duplicate the entire configuration to have something like hci-tls and hci-no-tls, but making it a j2 parameter here means we can process it at ci-framework level where we consume those variables and generated pass it as an extravar to the playbook.
It seems a little effort compared to the complexity of maintaining two scenarios for the same HCI topology.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed templating this is simple enough from the ci-framework side, though I'm wary of adding too much complexity here, to avoid ending up with something similar to what we had in install_yamls devsetup, I'll give it a try. The important question here I think is, will we have variations in the scenarios like this, i.e a hci-tls and hci-no-tls? I worked under the assumption that here we would target a small subset of well defined configurations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have too many variations: VAs/DTs are supposed to be already an "opinionated" combination of components that reflects what customers use, so I don't expect much difference other than things like tls. We can keep this bit also for a follow up, so we can do good choices when we already have a stable codebase to work on. My comment is more like: let's avoid duplication at this stage, and we can consolidate scenarios in a follow up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, thanks for the clarification, as said, the change is simple enough, so I expect I'll be able to updated both PRs with it by the end of day, thanks for the review, it's really useful to clear this points now

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/86fec2e0b2f24a13b1234d474fd2cfd1

✔️ adoption-standalone-to-crc-ceph SUCCESS in 2h 33m 50s
adoption-standalone-to-crc-no-ceph RETRY_LIMIT in 13m 03s

@cescgina cescgina force-pushed the add_osp17_1_scenarios branch 3 times, most recently from 23d2024 to e908b21 Compare October 2, 2024 08:55
cescgina added a commit to openstack-k8s-operators/ci-framework that referenced this pull request Oct 2, 2024
Add a new role that will deploy a tripleo environment that will serve as
source for adoption. This role is expected to cosume the infra created
by [1], and a 17.1 scenario definition from the data-plane-adoption
repo, introduced by [2].

It also introduce a small fix to the deploy-ocp.yml so the resulting ocp
cluster is ready (the nodes needed to be uncordoned).

[1] #2285
[2] openstack-k8s-operators/data-plane-adoption#597
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/ee6667016a734dfa9a6fecc8961761fb

adoption-standalone-to-crc-ceph TIMED_OUT in 4h 13m 50s
adoption-standalone-to-crc-no-ceph NODE_FAILURE Node request 100-0007603188 failed in 0s
adoption-docs-preview POST_FAILURE in 1m 14s

cescgina added a commit to openstack-k8s-operators/ci-framework that referenced this pull request Oct 3, 2024
Add a new role that will deploy a tripleo environment that will serve as
source for adoption. This role is expected to cosume the infra created
by [1], and a 17.1 scenario definition from the data-plane-adoption
repo, introduced by [2].

It also introduce a small fix to the deploy-ocp.yml so the resulting ocp
cluster is ready (the nodes needed to be uncordoned).

[1] #2285
[2] openstack-k8s-operators/data-plane-adoption#597
@slawqo
Copy link
Contributor

slawqo commented Oct 3, 2024

I am definitely not Tripleo expert here but this looks good for me in general.

Copy link
Contributor

@jistr jistr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Joan for putting this together i think it is very good. Now that i was able to think through the whole thing in combo with the other patch, i put in some very "futursitic concerns" as comments but they are not a -1, just some food for thought. LGTM thanks :).

- `config_download_file`: path to the config-download file used to pass
environment variables to the overcloud, required.
- `ceph_osd_spec_file`: path to the osd spec file used to deploy ceph when
applicable, optional.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we'll have use cases with multiple overcloud stacks. IIRC the DCN topology is deployed in TripleO via multiple stacks. I'm not sure if it would be better to have overclouds or a single overcloud and we would later add extra_overclouds or something like that. Even in the DCN scenario i think one of the overcloud stacks is "primary" where the control plane is hosted, and the other stacks are additive with computes only. So a design with a single primary overcloud stack could perhaps be the right thing to start with anyway.

I should add a disclaimer that this area is somewhat outside my comfort zone so it's possible that i'm talking nonsense.

Will tag @bogdando and @gibizer in case they have a clearer idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, I think it's "relatively" easy to change the overcloud definition to be a list of overclouds with the same parameters. I can give it a try locally, but waiting for more details on what it would be needed before pushing anything

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, you can as well just call them stack or stacks, omitting overcloud terminology.
The env files used for each stack deployment, alongside the pre and post execution hooks, must be instantiated as well. Let them holding only static values for the beginning, no templating yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack thanks @bogdando I'm preparing an update in this line, I'll update the PR once I've tested it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the overcloud section to a stacks one, that should support deploying multiple stacks (each has its own network_data, config-download, etc.)

- "/usr/share/openstack-tripleo-heat-templates/environments/manila-cephfsnative-config.yaml"
- "/usr/share/openstack-tripleo-heat-templates/environments/enable-legacy-telemetry.yaml"
- "/usr/share/openstack-tripleo-heat-templates/environments/cephadm/cephadm.yaml"
- "/usr/share/openstack-tripleo-heat-templates/environments/cephadm/ceph-mds.yaml"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered if we should get rid of args and vars and just put the whole deploy command as a variable to set. I mean, will the deployment command always be in the format that is defined in the other patch:

https://github.com/openstack-k8s-operators/ci-framework/pull/2297/files#diff-9e5f7c63638bbac9bfa4df077fe45e16b3acb56feaf89599aacce7a307624498R64

Same for the ceph deploy command, should we just export that out in full as a variable rather than have it in the other patch:

https://github.com/openstack-k8s-operators/ci-framework/pull/2297/files#diff-9a8face42b21d9562691b08533c92de7db4832e6314cb0b9c5479f42dd3d448cR131

In general i tend to be more in favor of exporting larger command chunks as variables perhaps due to being burned by Infrared in the past, where it tried to be very smart and compose the commands for us from various pieces of info, but we almost always ended up fighting it, trying to "trick it" to produce the commands we needed. So its "smartness" quite often stood in the way rather than helped.

I realize there is more complexity to this though, e.g. some of the overcloud deploy input files are generated inside the deployment role rather than provided from the scenario here, so what i'm suggesting above is perhaps easier said then done.

Maybe it would be best to stay the course and just have this as food for thought when designing/amending the overcloud deploy, ceph deploy and similar commands, that when we hardcode the structure for them, we should be hardcoding something quite un-opinionated.

As a last resort, we can introduce an override mechanism later in the form "If this variable is specified, use it as overcloud deploy command. If it is not specified, use the command hardcoded in the deployment role."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean, but as you mentioned dealing with the files generated dynamically will be tricky, I like your last suggestion though, to have an optional argument that'll allow the user to override the whole command, I'll add that and update both patches after testing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, i think it could be a separate PR later even, or done on a "when we need it" basis, since IIUC it wouldn't do breaking changes to the interface anyway. Up to you :).

mtu: 1500
vip: true
name_lower: storage
dns_domain: storage.{{ cloud_domain }}.tld.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the .tld bit too but leave the trailing '.'

storage.{{ cloud_domain }}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, fixed, thanks!

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/82a14b9b5957476a9a6a99e0fd217aa6

✔️ adoption-standalone-to-crc-ceph SUCCESS in 2h 34m 23s
adoption-standalone-to-crc-no-ceph NODE_FAILURE Node request 100-0007607025 failed in 0s
✔️ adoption-docs-preview SUCCESS in 1m 25s

service_net_map_replace: external
subnets:
external_subnet:
gateway_ip: '172.21.0.1'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just adding a note about what we discussed on friday; remove this one and add the External network to the controllers (otherwise tls setup fails during deployment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, will run a new test and upload the PR

scenarios/hci/roles.yaml Show resolved Hide resolved
scenarios/hci/roles.yaml Outdated Show resolved Hide resolved
Copy link
Contributor

@ciecierski ciecierski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

Introduce an scenarios folder that will contain the needed input to
deploy a 17.1 environment using the cifmw role added in [1]. The
scenario is defined by a variable file, with undercloud specific
parameters, overcloud specific parameters, hooks that can be called
before or after both the undercloud and overcloud deployment, and two
maps that relate the groups in inventory produced by the playbook that
created the infra, to Roles and roles hostnames, to make it easier to
work with different roles in different scenarios.

[1] openstack-k8s-operators/ci-framework#2297
cescgina added a commit to openstack-k8s-operators/ci-framework that referenced this pull request Oct 7, 2024
Add a new role that will deploy a tripleo environment that will serve as
source for adoption. This role is expected to cosume the infra created
by [1], and a 17.1 scenario definition from the data-plane-adoption
repo, introduced by [2].

It also introduce a small fix to the deploy-ocp.yml so the resulting ocp
cluster is ready (the nodes needed to be uncordoned).

[1] #2285
[2] openstack-k8s-operators/data-plane-adoption#597
@jistr
Copy link
Contributor

jistr commented Oct 8, 2024

@cescgina Ok to merge?

@cescgina
Copy link
Contributor Author

cescgina commented Oct 8, 2024

@cescgina Ok to merge?

I'd say yes, I think this is a good first version

@jistr jistr merged commit 1a2fb79 into openstack-k8s-operators:main Oct 8, 2024
2 checks passed
@cescgina cescgina deleted the add_osp17_1_scenarios branch October 8, 2024 16:14
openshift-merge-bot bot pushed a commit to openstack-k8s-operators/ci-framework that referenced this pull request Oct 14, 2024
Add a new role that will deploy a tripleo environment that will serve as
source for adoption. This role is expected to cosume the infra created
by [1], and a 17.1 scenario definition from the data-plane-adoption
repo, introduced by [2].

It also introduce a small fix to the deploy-ocp.yml so the resulting ocp
cluster is ready (the nodes needed to be uncordoned).

[1] #2285
[2] openstack-k8s-operators/data-plane-adoption#597
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants