Introduce DCN DT #392

sbekkerm · 2024-09-09T18:03:45Z

This PR introduces DCN VA, which builds upon the HCI VA architecture and is designed for multi-site deployment.

In addition to the regular configuration files, this PR includes Jinja templates for generating values.yaml and service-values.yaml files. These templates are essential for Zuul job execution, allowing for the creation of site-specific configuration files multiple times for each DCN site.

openshift-ci · 2024-09-09T18:03:56Z

Hi @sbekkerm. Thanks for your PR.

I'm waiting for a openstack-k8s-operators member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

examples/va/dcn/README.md

automation/vars/dcn.yaml

fultonj · 2024-09-16T12:41:25Z

@sbekkerm I see two changes needed up front.

Written Instructions

The readme files are incomplete. Please see the the four stage readme's for VA HCI:

https://github.com/openstack-k8s-operators/architecture/tree/main/examples/va/hci#stages

It contains English instructions that someone can read to implement the VA without ci-framework and using only the produced k8s manifests. If there are external automations for them, that's fine but I should be able to read the directions and reproduce your work so that we can have independent verification. Right now it looks like the the VA1 directions are still there and not updated. In my early example someone could read my directions and get a full deployment (and the extra directory with scripts can technically be ignored).

https://github.com/fultonj/dcn?tab=readme-ov-file#steps

No code should be required to implement what I'm talking about for this request. Just written instructions.

VA vs DT

Would you please change this so that it puts the added files into the dt directory instead of the va directory?

sbekkerm · 2024-09-16T13:21:33Z

@sbekkerm I see two changes needed up front.

Written Instructions

The readme files are incomplete. Please see the the four stage readme's for VA HCI:

https://github.com/openstack-k8s-operators/architecture/tree/main/examples/va/hci#stages

It contains English instructions that someone can read to implement the VA without ci-framework and using only the produced k8s manifests. If there are external automations for them, that's fine but I should be able to read the directions and reproduce your work so that we can have independent verification. Right now it looks like the the VA1 directions are still there and not updated. In my early example someone could read my directions and get a full deployment (and the extra directory with scripts can technically be ignored).

https://github.com/fultonj/dcn?tab=readme-ov-file#steps

No code should be required to implement what I'm talking about for this request. Just written instructions.

VA vs DT

Would you please change this so that it puts the added files into the dt directory instead of the va directory?

It contains the instructions. All four DCN steps are almost the same as HCI VA, except for the post-nova actions, which are already covered here: https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/dataplane-post-ceph.md#finalize-nova-computes
Additionally, it mentions that Steps 3, 4, and the Ceph installation need to be executed for each DCN site

The main difference between VA and HCI is in the values.yaml and service-values.yaml files. For example, the nncp values.yaml contains the configuration necessary for spine and leaf: https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/control-plane/nncp/values.yaml#L18

and the post-ceph service-values.yaml contains Glance Multi Store configuration:
https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/service-values.yaml#L117

Why should we use DT instead of VA?

fultonj · 2024-09-16T13:27:59Z

@sbekkerm I see two changes needed up front.

Written Instructions

The readme files are incomplete. Please see the the four stage readme's for VA HCI:
https://github.com/openstack-k8s-operators/architecture/tree/main/examples/va/hci#stages
It contains English instructions that someone can read to implement the VA without ci-framework and using only the produced k8s manifests. If there are external automations for them, that's fine but I should be able to read the directions and reproduce your work so that we can have independent verification. Right now it looks like the the VA1 directions are still there and not updated. In my early example someone could read my directions and get a full deployment (and the extra directory with scripts can technically be ignored).
https://github.com/fultonj/dcn?tab=readme-ov-file#steps
No code should be required to implement what I'm talking about for this request. Just written instructions.

VA vs DT

Would you please change this so that it puts the added files into the dt directory instead of the va directory?

It contains the instructions. All four DCN steps are almost the same as HCI VA, except for the post-nova actions, which are already covered here: https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/dataplane-post-ceph.md#finalize-nova-computes
Additionally, it mentions that Steps 3, 4, and the Ceph installation need to be executed for each DCN site

The main difference between VA and HCI is in the values.yaml and service-values.yaml files. For example, the nncp values.yaml contains the configuration necessary for spine and leaf: https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/control-plane/nncp/values.yaml#L18

and the post-ceph service-values.yaml contains Glance Multi Store configuration: https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/service-values.yaml#L117

Yes, I understand but I still need clear written instructions so I (or anyone else) can reproduce.

I want to collaborate with you on this and reproduce the results in my environment so I can find and fix bugs. I think the READMEs are missing too much and filling them in will help other engineers and the docs team. From a very high level it's so we can have https://en.wikipedia.org/wiki/Reproducibility

Why DT?

https://github.com/openstack-k8s-operators/architecture/blob/main/examples/dt/README.md

I don't think this should be an update to an existing DT, it should be a new DT, but it's a DT, not a VA.

I wouldn't want to hand the field the Jinja2 files. This is something we do for our CI but not yet ready to be a full blown VA we could hand to someone in the field. Maybe it can evolve into a VA in the future. For now, in order to merge what you have I think it should be a DT.

sbekkerm · 2024-09-16T14:45:44Z

@sbekkerm I see two changes needed up front.

Written Instructions

The readme files are incomplete. Please see the the four stage readme's for VA HCI:
https://github.com/openstack-k8s-operators/architecture/tree/main/examples/va/hci#stages
It contains English instructions that someone can read to implement the VA without ci-framework and using only the produced k8s manifests. If there are external automations for them, that's fine but I should be able to read the directions and reproduce your work so that we can have independent verification. Right now it looks like the the VA1 directions are still there and not updated. In my early example someone could read my directions and get a full deployment (and the extra directory with scripts can technically be ignored).
https://github.com/fultonj/dcn?tab=readme-ov-file#steps
No code should be required to implement what I'm talking about for this request. Just written instructions.

VA vs DT

Would you please change this so that it puts the added files into the dt directory instead of the va directory?

It contains the instructions. All four DCN steps are almost the same as HCI VA, except for the post-nova actions, which are already covered here: https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/dataplane-post-ceph.md#finalize-nova-computes
Additionally, it mentions that Steps 3, 4, and the Ceph installation need to be executed for each DCN site

The main difference between VA and HCI is in the values.yaml and service-values.yaml files. For example, the nncp values.yaml contains the configuration necessary for spine and leaf: https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/control-plane/nncp/values.yaml#L18
and the post-ceph service-values.yaml contains Glance Multi Store configuration: https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/service-values.yaml#L117

Yes, I understand but I still need clear written instructions so I (or anyone else) can reproduce.

I want to collaborate with you on this and reproduce the results in my environment so I can find and fix bugs. I think the READMEs are missing too much and filling them in will help other engineers and the docs team. From a very high level it's so we can have https://en.wikipedia.org/wiki/Reproducibility

Why DT?

https://github.com/openstack-k8s-operators/architecture/blob/main/examples/dt/README.md

I don't think this should be an update to an existing DT, it should be a new DT, but it's a DT, not a VA.

I wouldn't want to hand the field the Jinja2 files. This is something we do for our CI but not yet ready to be a full blown VA we could hand to someone in the field. Maybe it can evolve into a VA in the future. For now, in order to merge what you have I think it should be a DT.

The README contains all the steps to reproduce the environment. Could you please clarify what specifically is unclear?

These steps for deploying the control plane:
https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/control-plane.md
These steps for preparing nodes for ceph installation:
https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/dataplane-pre-ceph.md
These steps for configuring nodes after ceph installation:
https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/dataplane-post-ceph.md

The Jinja templates are not involved in the deployment process, as they are only used by CI to set the "CHANGEME" parameters in values.yaml and service-valiues.yaml

fultonj · 2024-09-16T18:39:40Z

@sbekkerm

Please move this from the va directory to the dt directory.

I am putting it in my backlog to go through your README line by line and attempt to reproduce what you have deployed and I'll ask you clarifying questions along the way which will point out what is incomplete in the READMEs.

sbekkerm · 2024-09-17T09:25:17Z

@fultonj

Moved from va to dt directory as requested.
Also added a high-level diagram to the README for more clarity. Let me know if you have any questions.

krcmarik

I am suggesting some changes based on what we had in 17.1 openstack services configs and/or to make tempest tests pass (I've managed to make compute and volume tempest suites to pass all tests with the proposed changes which I applied manually)

examples/dt/dcn/service-values.yaml

examples/dt/dcn/service-values.yaml.j2

examples/dt/dcn/service-values.yaml

examples/dt/dcn/values.yaml.j2

examples/dt/dcn/service-values.yaml.j2

fultonj · 2024-10-01T18:15:24Z

@sbekkerm Could this be rebased?

On Thursday Oct 3rd I'll do a deployment to test this proposed patch and leave feedback (sorry for the delay).

krcmarik

One more suggestion for the CR

examples/dt/dcn/service-values.yaml

examples/dt/dcn/service-values.yaml.j2

krcmarik · 2024-10-03T00:55:10Z

@sbekkerm @fultonj One more question, I am not sure what the final decision was but do we want to include multicell (cell per site) into this DCN DT? It's just adding some extra configuration for the control plane which I can do.

sbekkerm · 2024-10-03T11:09:40Z

@sbekkerm @fultonj One more question, I am not sure what the final decision was but do we want to include multicell (cell per site) into this DCN DT? It's just adding some extra configuration for the control plane which I can do.

In my view, it's a great suggestion. This feature will be beneficial in cases where AZ sites contain many compute nodes. Additionally, we can use this when we spread the control plane across AZs.

fultonj · 2024-10-03T12:44:40Z

@sbekkerm @fultonj One more question, I am not sure what the final decision was but do we want to include multicell (cell per site) into this DCN DT? It's just adding some extra configuration for the control plane which I can do.

In my view, it's a great suggestion. This feature will be beneficial in cases where AZ sites contain many compute nodes. Additionally, we can use this when we spread the control plane across AZs.

I agree.

My understanding is that there is agreement and that we can go ahead with using multicell in this DT.

That said there might be people who don't use multicell but I still think we should test this with multicell. If there was a way to run the resultant job with our without multicell by flipping one switch that would be really nice but that might be too much to ask in this first edition.

fultonj · 2024-10-03T12:50:06Z

@sbekkerm @fultonj One more question, I am not sure what the final decision was but do we want to include multicell (cell per site) into this DCN DT? It's just adding some extra configuration for the control plane which I can do.

@krcmarik do you want to try that in a stacked patch.

i.e. create a new PR which has been rebased on this one so that we can merge this one sooner.

examples/dt/dcn/service-values.yaml

examples/dt/dcn/service-values.yaml.j2

hjensas · 2024-10-08T21:13:54Z

I am not a fan of the introduction of jinja2 templates - and the custom playbook to execute them.
Can we not instead define multiple nodesets, and use the "standard" kustomize_deploy role to apply this?
There are atleast two other PR's proposed that use multiple nodeset approach.

sbekkerm · 2024-10-09T08:40:17Z

I am not a fan of the introduction of jinja2 templates - and the custom playbook to execute them. Can we not instead define multiple nodesets, and use the "standard" kustomize_deploy role to apply this? There are atleast two other PR's proposed that use multiple nodeset approach.

Add multi-cell DT #401

DT - BMO deploy with preprovisioningNetworkData #413

@hjensas
I get what you mean about preferring to use multiple nodesets "standard" role, but the DCN setup is a bit more complex than that.

It’s not just about handling multiple nodesets with spine and leaf—it’s about managing multiple sites, each with its own configuration for services like Glance and Cinder. Plus, each site needs its own Ceph cluster, which adds another layer of complexity. That’s why we went with Jinja2 templates and a custom playbook for this.

But I do agree with you that eventually integrating this into the ci-framework would be ideal

examples/dt/dcn/service-values.yaml

examples/dt/dcn/README.md

fultonj · 2024-10-22T14:01:36Z

@sbekkerm I see two changes needed up front.

Written Instructions

VA vs DT

These changes have been addressesd.

fmount · 2024-10-23T10:32:48Z

examples/dt/dcn/service-values.yaml

+        rbd_cluster_name = az2
+        backend_availability_zone = az2
+  glance:
+    customServiceConfig: |


This customServiceConfig at glance level seems redundant to me.
Given you're defining the backend for each API under glanceAPIs (including az0), you can basically remove from L69 to L79 [1].
You inherit the top-level customServiceConfig only if it's not defined for that specific instance, and this does not seem the case.
@fultonj do you want to update the glance config?

[1] https://github.com/openstack-k8s-operators/glance-operator/blob/main/controllers/glance_controller.go#L890

Technically it doesn't block your deployment because as per the link I mentioned earlier the right config ends up being on the right instance, but I think this is simply redundant and should be patched.

I agree but per our discussion I will submit a follow up patch to address this.

Then I'm ok to move forward with this.

fultonj

I'm +2 on merging this.

fmount

/lgtm

abays

/lgtm
/approve

fultonj

/approve
/lgtm

openshift-ci · 2024-10-24T18:43:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abays, fmount, fultonj, sbekkerm

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [abays,fultonj]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

softwarefactory-project-zuul · 2024-10-24T18:47:29Z

Build succeeded (gate pipeline).
https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/d2ccfc4d99f44bfe825442753996e878

✔️ noop SUCCESS in 0s
✔️ rhoso-architecture-validate-dcn SUCCESS in 3m 28s

fultonj · 2024-10-25T12:30:37Z

/cherrypick 18.0-fr1

openshift-cherrypick-robot · 2024-10-25T12:31:15Z

@fultonj: new pull request created: #426

In response to this:

/cherrypick 18.0-fr1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

fultonj · 2024-10-25T13:02:42Z

@sbekkerm @fultonj One more question, I am not sure what the final decision was but do we want to include multicell (cell per site) into this DCN DT? It's just adding some extra configuration for the control plane which I can do.

@krcmarik do you want to try that in a stacked patch.
i.e. create a new PR which has been rebased on this one so that we can merge this one sooner.

We are using cells now. We can do follow ups for further changes.

To clarify, we have cells, but we're not using cells per site. I think we should address that in a follow up patch.

openshift-ci bot requested review from leifmadsen and raukadah September 9, 2024 18:03

openshift-ci bot added the needs-ok-to-test label Sep 9, 2024

fultonj self-requested a review September 11, 2024 12:15

abays reviewed Sep 16, 2024

View reviewed changes

examples/va/dcn/README.md Outdated Show resolved Hide resolved

abays reviewed Sep 16, 2024

View reviewed changes

automation/vars/dcn.yaml Outdated Show resolved Hide resolved

sbekkerm changed the title ~~Introduce DCN VA~~ Introduce DCN DT Sep 17, 2024

krcmarik reviewed Sep 21, 2024

View reviewed changes

krcmarik reviewed Oct 2, 2024

View reviewed changes

examples/dt/dcn/service-values.yaml Show resolved Hide resolved

examples/dt/dcn/service-values.yaml.j2 Outdated Show resolved Hide resolved

sbekkerm force-pushed the dcn branch from b469f9a to 1953534 Compare October 2, 2024 08:00

krcmarik reviewed Oct 8, 2024

View reviewed changes

examples/dt/dcn/service-values.yaml Outdated Show resolved Hide resolved

examples/dt/dcn/service-values.yaml Outdated Show resolved Hide resolved

krcmarik reviewed Oct 8, 2024

View reviewed changes

examples/dt/dcn/service-values.yaml.j2 Outdated Show resolved Hide resolved

examples/dt/dcn/service-values.yaml.j2 Outdated Show resolved Hide resolved

fmount requested changes Oct 9, 2024

View reviewed changes

examples/dt/dcn/service-values.yaml Outdated Show resolved Hide resolved

examples/dt/dcn/service-values.yaml Outdated Show resolved Hide resolved

examples/dt/dcn/service-values.yaml Outdated Show resolved Hide resolved

examples/dt/dcn/service-values.yaml Outdated Show resolved Hide resolved

openshift-ci bot assigned fmount Oct 9, 2024

fmount reviewed Oct 9, 2024

View reviewed changes

examples/dt/dcn/service-values.yaml Show resolved Hide resolved

examples/dt/dcn/service-values.yaml Show resolved Hide resolved

fultonj reviewed Oct 9, 2024

View reviewed changes

examples/dt/dcn/README.md Outdated Show resolved Hide resolved

sbekkerm force-pushed the dcn branch from 18e2036 to 8f2f12a Compare October 22, 2024 13:56

fmount reviewed Oct 23, 2024

View reviewed changes

fmount self-requested a review October 23, 2024 10:33

fmount approved these changes Oct 23, 2024

View reviewed changes

openshift-ci bot added the lgtm label Oct 23, 2024

fmount removed the lgtm label Oct 23, 2024

fmount self-requested a review October 23, 2024 11:08

fultonj mentioned this pull request Oct 23, 2024

Introduce ci_dcn_site role openstack-k8s-operators/ci-framework#2458

Merged

fultonj reviewed Oct 24, 2024

View reviewed changes

fmount approved these changes Oct 24, 2024

View reviewed changes

openshift-ci bot added the lgtm label Oct 24, 2024

Introduce support for DCN DT

710f7ad

fultonj force-pushed the dcn branch from 8f2f12a to 710f7ad Compare October 24, 2024 15:44

openshift-ci bot removed the lgtm label Oct 24, 2024

abays approved these changes Oct 24, 2024

View reviewed changes

openshift-ci bot assigned abays Oct 24, 2024

openshift-ci bot added lgtm approved labels Oct 24, 2024

fultonj approved these changes Oct 24, 2024

View reviewed changes

openshift-ci bot assigned fultonj Oct 24, 2024

fultonj removed the needs-ok-to-test label Oct 24, 2024

softwarefactory-project-zuul bot merged commit 2b53bd2 into openstack-k8s-operators:main Oct 24, 2024
9 checks passed

openshift-cherrypick-robot mentioned this pull request Oct 25, 2024

[18.0-fr1] Introduce DCN DT #426

Merged

sbekkerm deleted the dcn branch October 25, 2024 14:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce DCN DT #392

Introduce DCN DT #392

sbekkerm commented Sep 9, 2024

openshift-ci bot commented Sep 9, 2024

fultonj commented Sep 16, 2024

sbekkerm commented Sep 16, 2024

fultonj commented Sep 16, 2024 •

edited

Loading

sbekkerm commented Sep 16, 2024

fultonj commented Sep 16, 2024

sbekkerm commented Sep 17, 2024

krcmarik left a comment

fultonj commented Oct 1, 2024

krcmarik left a comment

krcmarik commented Oct 3, 2024

sbekkerm commented Oct 3, 2024

fultonj commented Oct 3, 2024

fultonj commented Oct 3, 2024

hjensas commented Oct 8, 2024

sbekkerm commented Oct 9, 2024 •

edited

Loading

fultonj commented Oct 22, 2024

fmount Oct 23, 2024

fmount Oct 23, 2024

fultonj Oct 23, 2024

fmount Oct 23, 2024

fultonj left a comment

fmount left a comment

abays left a comment

fultonj left a comment

openshift-ci bot commented Oct 24, 2024

softwarefactory-project-zuul bot commented Oct 24, 2024

fultonj commented Oct 25, 2024

openshift-cherrypick-robot commented Oct 25, 2024

fultonj commented Oct 25, 2024

Introduce DCN DT #392

Introduce DCN DT #392

Conversation

sbekkerm commented Sep 9, 2024

openshift-ci bot commented Sep 9, 2024

fultonj commented Sep 16, 2024

sbekkerm commented Sep 16, 2024

fultonj commented Sep 16, 2024 • edited Loading

sbekkerm commented Sep 16, 2024

fultonj commented Sep 16, 2024

sbekkerm commented Sep 17, 2024

krcmarik left a comment

Choose a reason for hiding this comment

fultonj commented Oct 1, 2024

krcmarik left a comment

Choose a reason for hiding this comment

krcmarik commented Oct 3, 2024

sbekkerm commented Oct 3, 2024

fultonj commented Oct 3, 2024

fultonj commented Oct 3, 2024

hjensas commented Oct 8, 2024

sbekkerm commented Oct 9, 2024 • edited Loading

fultonj commented Oct 22, 2024

fmount Oct 23, 2024

Choose a reason for hiding this comment

fmount Oct 23, 2024

Choose a reason for hiding this comment

fultonj Oct 23, 2024

Choose a reason for hiding this comment

fmount Oct 23, 2024

Choose a reason for hiding this comment

fultonj left a comment

Choose a reason for hiding this comment

fmount left a comment

Choose a reason for hiding this comment

abays left a comment

Choose a reason for hiding this comment

fultonj left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Oct 24, 2024

softwarefactory-project-zuul bot commented Oct 24, 2024

fultonj commented Oct 25, 2024

openshift-cherrypick-robot commented Oct 25, 2024

fultonj commented Oct 25, 2024

fultonj commented Sep 16, 2024 •

edited

Loading

sbekkerm commented Oct 9, 2024 •

edited

Loading