This repository contains Ansible scripts for deploying an OpenShift cluster with NVIDIA vGPU on VMware to Equinix Metal servers. Currently, only Single Node OpenShift (SNO) is supported.
- An Equinix Metal account for provisioning a bare metal server with a GPU model supported by NVIDIA vGPU for vSphere 7.0.
- An existing Equinix Metal project where you can provision servers.
- An NVIDIA account with access to the vGPU packages for vSphere 7.0.
- A VMware account with access to ESXi and vSphere installation packages, e.g. the evaluation versions (make sure you're logged in at VMware Customer Connect before accessing the link).
- A Red Hat account with access to assisted installer OpenShift clusters.
- Prepare the required artifacts in an AWS S3 compatible storage.
- Provision a bare metal machine that has a vGPU-compatible NVIDIA GPU.
- Install VMware ESXi 7.0 on the bare metal machine.
- Install and configure the VMware vSphere 7.0 virtual appliance.
- Install from a vSphere Installation Bundle (VIB), and configure the vGPU host driver for ESXi.
- Install an OpenShift cluster inside vSphere virtual machines.
- Add a vGPU device to the OpenShift cluster's workers.
- Deploy the NVIDIA GPU Operator on the OpenShift cluster. The cluster will now have access to the vGPU.
For manual steps for deploying OpenShift with vGPU on VMware vSphere, refer to the OpenShift Container Platform on VMware vSphere with NVIDIA vGPUs guide.
The Equinix Metal server plan that has a vGPU-compatible GPU is g2.large.x86, but ESXi 7.0 isn't available for it out of the box. The solution is to provision a machine with ESXi 6.5 and then upgrade to ESXi 7.0. VMware vSphere installation is explained in detail in Multi-node vSphere with vSan support. Information on upgrading ESXi nodes can be found in Deploy VMware ESXi 6.7 on Packet Bare Metal Servers
During installation, the scripts will need access to the following artifacts stored in S3-compatible object storage.
- VMware vCenter Virtual Appliance ISO, e.g. VMware-VCSA-all-7.0.3-19234570.iso.
- NVIDIA vGPU host driver for VMware ESXi 7.0, e.g. NVD-AIE_510.47.03-1OEM.702.0.0.17630552_19298122.zip. Extract the driver from the NVIDIA AI Enterprise Software for VMware vSphere 7.0 package you have downloaded from NVIDIA Licensing Portal.
In case of a multi-node VMware cluster with vSAN, also extract the following files from vsan-sdk-python.zip:
- bindings/vsanmgmtObjects.py
- samplecode/vsanapiutils.py
You can store the files in an AWS S3 bucket, or a locally deployed S3 server (e.g. Minio).
-
Install the required Ansible dependencies:
ansible-galaxy install -r requirements.yml
-
Obtain an OpenShift offline token from https://console.redhat.com/openshift/token.
-
Download a pull secret file from https://console.redhat.com/openshift/install/pull-secret.
-
Create a YAML file (e.g. vars.yml) with the following mandatory parameters:
# Object storage s3_url: https://s3.amazonaws.com # or http://<minio_ip>:9000 for Minio s3_bucket: <bucket_with_artifacts> s3_access_key: <access_key> s3_secret_key: <secret_key> # Equinix Metal equinix_metal_api_key: <api_key> equinix_metal_project_id: <existing_project_id> equinix_metal_hostname_prefix: <prefix> # identify servers in a shared project, e.g. your username and/or OpenShift cluster name # OpenShift ocm_offline_token: <offline_token> pull_secret_path: <path/to/pull_secret> openshift_base_domain: <base_dns_domain> # NVIDIA # from https://ngc.nvidia.com/setup/api-key ngc_api_key: <ngc_api_key> ngc_email: <[email protected]> nls_client_token: <nls_client_license_token> # see https://docs.nvidia.com/license-system/latest/pdf/nvidia-license-system-user-guide.pdf
Other variables that can be changed are declared in the playbooks.
- Specify a
local_temp_dir
that will store the Terraform state, temporary configuration, and credential files. - Specify an
openshift_cluster_name
for your OpenShift cluster.
ansible-playbook sno.yml -e "@path/to/vars.yml" -e local_temp_dir="/path/to/temp/dir" -e openshift_cluster_name="<cluster_name>"
WARNING: Before running the Ansible playbook, make sure you do not have an old cluster named
<openshift_cluster_name>
in your Assisted Clusters.
Take a note of the parameters for connecting to the VMware vSphere cluster, OpenShift node(s), etc. For example:
{
"msg": [
"Bastion host: 147.28.143.219",
"vCenter IP: 147.28.142.42",
"vCenter credentials in tmp/vcenter.txt",
"VPN connection details in tmp/vpn.txt"
]
}
By default, the vCenter appliance and OpenShift cluster will be deployed with a public network, which means that they will be accessible on the Internet. Although it might be convenient, this is probably not the best choice from a security point of view. You can deploy the VMs in private networks by setting use_private_networks=true
, and access vCenter API and the OpenShift cluster only via the bastion host. When using private networks, the OpenShift cluster's kubeconfig, credentials and SSH key pair will be saved to a directory on the bastion host.
It is possible to deploy multiple vSphere clusters, each with its own OpenShift cluster:
-
Set the
local_temp_dir
variable so that the Terraform state and everything related to a new cluster is stored separately in a new directory and does not use an existing directory. -
Set an
openshift_cluster_name
that does not conflict with an existing assisted OpenShift cluster.
WARNING: You will need to specify the same directory and cluster name when destroying the cluster, otherwise you may destroy another cluster by mistake.
In most cases it should be enough to:
- Destroy the provisioned Equinix Metal machines
terraform destroy --var-file=../tmp/terraform.tfvars
- Delete the SNO cluster from the Red Hat Cloud Console
You can also clean up automatically by running:
ansible-playbook destroy.yml -e "@path/to/vars.yml" -e local_temp_dir="/path/to/temp/dir" -e openshift_cluster_name="<cluster_name>"
Terraform destroy logs will be saved in <temp_directory>/terraform-destroy.stdout[.timestamp] and <temp_directory>/terraform-destroy.stderr[.timestamp].
IMPORTANT: In order for the GPU Operator to work correctly with a vGPU, a supported vGPU type must be selected in the
vgpu_profile
variable.
For details on vGPU types refer to Listing Supported vGPU Types.
The list of available vGPU type values can be obtained via the vCenter GUI, or by running vim-cmd hostsvc/hostconfig | grep -A 20 sharedPassthruGpuTypes
on an ESXi host after configuring the host graphics.
Choose a vGPU type that supports CUDA — a C- or Q-series vGPU on Tesla V100, which is the GPU offered by the Equinix Metal machine type we use.
On Tesla V100 at the time of this writing:
- grid_v100dx-1b
- grid_v100dx-2b
- grid_v100dx-1b4
- grid_v100dx-2b4
- grid_v100dx-1q
- grid_v100dx-2q
- grid_v100dx-4q
- grid_v100dx-8q
- grid_v100dx-16q
- grid_v100dx-32q
- grid_v100dx-1a
- grid_v100dx-2a
- grid_v100dx-4a
- grid_v100dx-8a
- grid_v100dx-16a
- grid_v100dx-32a
- grid_v100dx-4c
- grid_v100dx-8c
- grid_v100dx-16c
- grid_v100dx-32c