Skip to content

Commit

Permalink
Merge pull request #6 from rfisher001/lab-4.12To4.14
Browse files Browse the repository at this point in the history
Updated OCP upgrade prep page
  • Loading branch information
rfisher001 authored Feb 9, 2024
2 parents 7a345b7 + 902b8ac commit 22b8f63
Show file tree
Hide file tree
Showing 3 changed files with 188 additions and 28 deletions.
6 changes: 3 additions & 3 deletions documentation/modules/ROOT/pages/CNF-Upgrade-Prep.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ This is specified because we need to move away from the idea that each and every
[#cnf-req-doc]
== CNF Requirements Document

Before you go any further, please read through the https://connect.redhat.com/sites/default/files/2022-05/Cloud%20Native%20Network%20Function%20Requirements%201-3.pdf[CNF requirements document].
Before you go any further, please read through the https://test-network-function.github.io/cnf-best-practices-guide/[CNF requirements document].
In this section a few of the most important points will be discussed but the CNF Requirements Document has additional
detail and other important topics.

Expand Down Expand Up @@ -46,10 +46,10 @@ True high availability requires a duplication of a process to be running on sepa

During an upgrade anti-affinity is important so that there aren’t too many pods on a node when it is time for it to reboot. For example: if there are 4 pods from a single deployment on a node, and the PDB is set to only allow 1 pod be deleted at a time, then it will take 4 times a long for that node to reboot because it will be waiting on all 4 pods to be deleted.

== Liveness / Readiness Probes
== Liveness / Readiness / Startup Probes

OpenShift and Kubernetes have some built in features that not everyone takes advantage of called
https://docs.openshift.com/container-platform/4.12/applications/application-health.html[liveness and readiness probes].
https://docs.openshift.com/container-platform/4.12/applications/application-health.html[liveness, readiness and startup probes].
These are very important when POD deployments are dependent upon keeping state for their application. This document
won’t go into detail regarding these probes but please review the https://docs.openshift.com/container-platform/4.12/applications/application-health.html[OpenShift documentation]
on how to implement their use.
206 changes: 183 additions & 23 deletions documentation/modules/ROOT/pages/OCP-upgrade-prep.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,15 @@ include::_attributes.adoc[]
[#firmware-compatibility]
== Firmware compatibility

Firmware is NOT part of the OpenShift upgrade and can not be controlled through OpenShift. However, in Telco we run most of our clusters on baremetal which means that we need to be very mindful of the firmware level on all of our hardware.

All hardware vendors will advise that it is always best to be on their latest certified version of firmware for their
hardware. In the telco world this comes with a trust but verify approach due to the high throughput nature of telco
CNFs. Therefore, it is important to have a regimented group who can test the current and latest firmware from any vendor
to make sure that all components will work with both. It is not always recommended to upgrade firmware in conjunction
with an OCP upgrade however if it is possible to test the latest release of firmware that will improve the odds that
you won’t run into issues down the road. +
you won’t run into issues down the road.

Upgrading firmware is a very important debate because the process can be very intrusive and has a potential for causing
a node to require manual interventions before the node will come back online. On the other hand it may be imperative to
upgrade the firmware due to security fixes, new required functionality or compatibility with the new release of OCP
Expand All @@ -32,41 +35,198 @@ chapter2 gitlab-operator-kubernetes.v0.17.2 GitL
openshift-operator-lifecycle-manager packageserver Package Server 0.19.0 Succeeded
----

=== OLM Operator compatibility
There is a set of Red Hat Operators that are NOT part of the cluster operators which are otherwise known as the OLM installed operators. To determine the compatibility of these OLM installed operators there is a great web based tool that can be used to determine which versions of OCP are compatible with specific releases of an Operator. This tool is meant to tell you if you need to upgrade an Operator after each Y-Stream upgrade or if you can wait until you have fully upgraded to the next EUS release.
In Step 9 under the “Upgrade Process Flow” section you will find additional information regarding what you need to do if an Operator needs to be upgraded after performing the first Y-Stream Control Plane upgrade.

NOTE: Some Operators are compatible with several releases of OCP. So, you may not need to upgrade until you complete the cluster upgrade. This is shown in Step 13 of the Upgrade Process Flow.

=== Non-Red Hat OLM Operators
For all OLM installed Operators that are NOT directly supported by Red Hat, please contact the supporting vendor to make sure of release compatibility.


[#prepare-mcp]
== Prepare MCPs

Prepare your Machine Config Pool (MCP) labels by grouping your nodes, depending on the number of nodes in your cluster.
MCPs should be split up into 8 to 10 nodes per group. However, there is no hard fast rule as to how many nodes need to
be in each MCP. The purpose of these MCPs is to group nodes together so that a group of nodes can be controlled
independently of the rest. Additional information and examples can be found https://docs.openshift.com/container-platform/4.12/updating/update-using-custom-machine-config-pools.html[here, under the canary rollout documentation]. +
These MCPs will be used to un-pause a set of nodes during the upgrade process, thus allowing them to be upgraded and
rebooted at a determined time instead of at the pleasure of the scheduler. Please review the upgrade process flow
section, below, for more details on the pause/un-pause process.
Prepare your Machine Config Pool (MCP) labels by grouping your nodes, depending on the number of nodes in your cluster. MCPs should be split up into 8 to 10 nodes per group. However, there is no hard fast rule as to how many nodes need to be in each MCP. The purpose of these MCPs is to group nodes together so that the upgrade and reboot of a group of nodes can be controlled independently of the rest. Additional information and examples can be found here, under the https://docs.openshift.com/container-platform/4.14/updating/updating_a_cluster/eus-eus-update.html#updating-eus-to-eus-upgrade-cli_eus-to-eus-update[EUS to EUS upgrade documentation].
These MCPs will be used to un-pause a set of nodes during the upgrade process, thus allowing them to be upgraded and rebooted at a determined time instead of at the pleasure of the scheduler. Please review the upgrade process flow section, below, for more details on the pause/un-pause process.

[#what-purpose-upgrade-some]
=== What is the purpose of only upgrading some nodes?

During an upgrade there is always a chance that there will be a problem. Most often the problem is related to hardware failure or needing to be reset. If a problem were to occur, having a set of MCPs in a paused state allows the cluster administrator to make sure there are enough nodes running at all times to keep all applications running. The most important thing is to make sure there are enough resources for all application pods.

=== How should worker nodes be divided into MCPs?

This can vary depending on how many nodes are in the cluster or how many nodes are in a node role. By default the 2 roles in a cluster are master and worker. However, in Telco clusters we quite often split the worker nodes out into 2 separate roles of control plane and data plane. The most important thing is to add MCPs to split out the nodes in each of these two groups.

For example: +
We have 15 worker nodes in a cluster +
10 worker nodes are in the control plane role +
5 worker nodes are in the data plane role +

In this cluster you should split both the CNF control plane and data plane worker node roles into at least 2 MCPs. +
The advantage is that having 2 MCPs will allow you to have one set of nodes that are not affected by the upgrade if you need a stopping point.

An example of a larger cluster might have as many as 100 worker nodes in the control plane role. In this case you would want to make sure that you keep each MCP to around 10 nodes. This will allow you to either unpause multiple MCPs at a time to speed up the upgrade or allow you to separate out the upgrade over at least 2 maintenance windows.

Here is a depiction of what this would look like:

// insert image for MCP
.Worker node MCPs in a 5 rack cluster
image::5Rack-MCP.jpg[]

The division and size of these MCPs can vary depending on many factors. In general the standard division is between 8
and 10 nodes per MCP to allow the operations team to control how many nodes are taken down at a time.
The division and size of these MCPs can vary depending on many factors. In general the standard division in large clusters is between 8 and 10 nodes per MCP to allow the operations team to control how many nodes are taken down at a time.

.Separate MCPs inside of a group of Load Balancer or purpose built nodes
image::LBorHT-MCP.jpg[]
Here is another example but with a medium cluster. +
There are a total of 6 nodes in this role. The MCPs inside of this role are split into 2 node MCPs.
The smaller MCPs allow 2 nodes to be upgraded which can then sit through a day to allow for verification of CNF compatibility before completing the upgrade on the other 4 nodes.

In larger clusters there is quite often a need to separate out several nodes for purposes like Load Balancing or other
high throughput purposes, which usually have different machine sets to configure SR-IOV. In these cases we do not want
to upgrade all of these nodes without getting a chance to test during the upgrade. Therefore, we need to separate them
out into at least 3 different MCPs and unpause them individually.
.Separate MCPs in a medium cluster into 3 MCPs
image::LBorHT-MCP.jpg[]

// insert image for MCP
.Small cluster worker MCPs
image::Worker-MCP.jpg[]

Smaller cluster example with 1 rack
The process and pace at which you un-pause the MCPs is determined by your CNFs and their configuration. Please review the sections on PDB and anti-affinity for CNFs. If your CNF can properly handle scheduling within an OpenShift cluster you can un-pause several MCPs at a time and set the MaxUnavailable to as high as 50%. This will allow as many as half of the nodes in an MCPs to restart and upgrade. This will reduce the amount of time that is needed for a specific maintenance window and allow your cluster to upgrade quickly.

=== Review your cluster for available MCPs and nodes

First you can run “oc get mcp” to show your current list of MCPs:

[source, bash]
----
# oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-b…e83 True False False 3 3 3
0 25d
worker rendered-worker-2….c4f True False False 2 2 2
0 25d
----

Now list out all of the nodes in your cluster:

NOTE: Review what is listed in the “ROLES” column, this will get updated as we move through this process.

[source, bash]
----
# oc get no
NAME STATUS ROLES AGE VERSION
euschannel-ctlplane-0.test.corp Ready master 25d v1.23.12+8a6bfe4
euschannel-ctlplane-1.test.corp Ready master 25d v1.23.12+8a6bfe4
euschannel-ctlplane-2.test.corp Ready master 25d v1.23.12+8a6bfe4
euschannel-worker-0.test.corp Ready worker 25d v1.23.12+8a6bfe4
euschannel-worker-1.test.corp Ready worker 25d v1.23.12+8a6bfe4
----

Determine, from the above suggestions, how you would like to separate out your worker nodes into machine config pools (MCP).

=== Create your MCPs

The process and pace at which you un-pause the MCPs is determined by your CNFs and their configuration. Please review
the sections on PDB and anti-affinity for CNFs. If your CNF can properly handle scheduling within an OpenShift cluster
you can un-pause several MCPs at a time and set the MaxUnavailable to as high as 50%. This will allow as many as half
of the nodes in your MCPs to restart and upgrade. This will reduce the amount of time that is needed for a specific
maintenance window and allow your cluster to upgrade quickly. Hopefully you can see how keeping your PDB and
anti-affinity correctly configured will help in the long run.
This is a 2 step process:

. We add a MCP label to each node
. We apply an MCP to the cluster which will organize the nodes based on their labels

NOTE: In the following example there are only 2 nodes and 2 MCPs. Therefore, each MCP only has 1 node in each.

==== Labeling nodes

We first need to label the nodes so that they can be put into MCPs. We will do this with the following commands:

[source, bash]
----
oc label node euschannel-worker-0.test.corp node-role.kubernetes.io/mcp-1=
oc label node euschannel-worker-1.test.corp node-role.kubernetes.io/mcp-2=
----

NOTE: The labels will show up when you run the “oc get node” command:

[source, bash]
----
# oc get no
NAME STATUS ROLES AGE VERSION
euschannel-ctlplane-0.test.corp Ready master 25d v1.23.12+8a6bfe4
euschannel-ctlplane-1.test.corp Ready master 25d v1.23.12+8a6bfe4
euschannel-ctlplane-2.test.corp Ready master 25d v1.23.12+8a6bfe4
euschannel-worker-0.test.corp Ready mcp-1,worker 25d v1.23.12+8a6bfe4
euschannel-worker-1.test.corp Ready mcp-2,worker 25d v1.23.12+8a6bfe4
----

==== Applying MCPs according to label

Now you need to create yaml files that will apply the labels as MCPs in your cluster. Each MCP will have to have a separate file or a separate section (as it is shown below).

Here is one example:

[source, bash]
----
---
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
name: mcp-2
spec:
machineConfigSelector:
matchExpressions:
- {
key: machineconfiguration.openshift.io/role,
operator: In,
values: [worker,mcp-2]
}
nodeSelector:
matchLabels:
node-role.kubernetes.io/mcp-2: ""
---
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
name: mcp-1
spec:
machineConfigSelector:
matchExpressions:
- {
key: machineconfiguration.openshift.io/role,
operator: In,
values: [worker,mcp-1]
}
nodeSelector:
matchLabels:
node-role.kubernetes.io/mcp-1: ""
----

Apply or create the MCPs through:

[source, bash]
----
# oc apply -f mcps.yaml
machineconfigpool.machineconfiguration.openshift.io/mcp-2 created
----

==== Monitor MCP formation
Now you can run “oc get mcp” again and your new MCPs will show. It will take a few minutes for your nodes to move into the new MCPs that they are assigned to. However, the nodes will NOT reboot during this time.

NOTE: You will still see the original worker and master MCPs that are part of the cluster.

This is what it will look like right after you apply the MCPs:
[source, bash]
----
# oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-b…e83 True False False 3 3 3 0 25d
mcp-1 rendered-mcp-1-2…c4f False True True 1 0 0 0 10s
mcp-2 rendered-mcp-2-2…c4f False True True 1 0 0 0 10s
worker rendered-worker-2…c4f False True True 0 0 0 2 25d
----

This is what it will look like after the MCPs have been completely applied:
[source, bash]
----
# oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-b…e83 True False False 3 3 3 0 25d
mcp-1 rendered-mcp-1-2…c4f True False False 1 1 1 0 7m33s
mcp-2 rendered-mcp-2-2…c4f True False False 1 1 1 0 51s
worker rendered-worker-2…c4f True False False 0 0 0 0 25d
----
4 changes: 2 additions & 2 deletions site.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ content:

asciidoc:
attributes:
release-version: 4.12To4.14
release-version: 4.12-4.14
page-pagination: true
extensions:
- ./lib/tab-block.js
Expand All @@ -32,4 +32,4 @@ ui:
contents: "static_files: [ .nojekyll ]"

output:
dir: ./gh-pages
dir: ./gh-pages

0 comments on commit 22b8f63

Please sign in to comment.