Deploying an OpenShift Cluster with vGPU on VMware vSphere

This repository contains Ansible scripts for deploying an OpenShift cluster with NVIDIA vGPU on VMware to Equinix Metal servers. Currently, only Single Node OpenShift (SNO) is supported.

Requirements

An Equinix Metal account for provisioning a bare metal server with a GPU model supported by NVIDIA vGPU for vSphere 7.0.
An existing Equinix Metal project where you can provision servers.
An NVIDIA account with access to the vGPU packages for vSphere 7.0.
A VMware account with access to ESXi and vSphere installation packages, e.g. the evaluation versions (make sure you're logged in at VMware Customer Connect before accessing the link).
A Red Hat account with access to assisted installer OpenShift clusters.

Steps Overview

Prepare the required artifacts in an AWS S3 compatible storage.
Provision a bare metal machine that has a vGPU-compatible NVIDIA GPU.
Install VMware ESXi 7.0 on the bare metal machine.
Install and configure the VMware vSphere 7.0 virtual appliance.
Install from a vSphere Installation Bundle (VIB), and configure the vGPU host driver for ESXi.
Install an OpenShift cluster inside vSphere virtual machines.
Add a vGPU device to the OpenShift cluster's workers.
Deploy the NVIDIA GPU Operator on the OpenShift cluster. The cluster will now have access to the vGPU.

For manual steps for deploying OpenShift with vGPU on VMware vSphere, refer to the OpenShift Container Platform on VMware vSphere with NVIDIA vGPUs guide.

VMware vSphere 7.0 on Equinix Metal

The Equinix Metal server plan that has a vGPU-compatible GPU is g2.large.x86, but ESXi 7.0 isn't available for it out of the box. The solution is to provision a machine with ESXi 6.5 and then upgrade to ESXi 7.0. VMware vSphere installation is explained in detail in Multi-node vSphere with vSan support. Information on upgrading ESXi nodes can be found in Deploy VMware ESXi 6.7 on Packet Bare Metal Servers

Required VMware & NVIDIA Artifacts

During installation, the scripts will need access to the following artifacts stored in S3-compatible object storage.

VMware vCenter Virtual Appliance ISO, e.g. VMware-VCSA-all-7.0.3-19234570.iso.
NVIDIA vGPU host driver for VMware ESXi 7.0, e.g. NVD-AIE_510.47.03-1OEM.702.0.0.17630552_19298122.zip. Extract the driver from the NVIDIA AI Enterprise Software for VMware vSphere 7.0 package you have downloaded from NVIDIA Licensing Portal.

In case of a multi-node VMware cluster with vSAN, also extract the following files from vsan-sdk-python.zip:

bindings/vsanmgmtObjects.py
samplecode/vsanapiutils.py

You can store the files in an AWS S3 bucket, or a locally deployed S3 server (e.g. Minio).

Setup

Install the required Ansible dependencies:

ansible-galaxy install -r requirements.yml

Obtain an OpenShift offline token from https://console.redhat.com/openshift/token.
Download a pull secret file from https://console.redhat.com/openshift/install/pull-secret.

Create a YAML file (e.g. vars.yml) with the following mandatory parameters:

# Object storage
s3_url: https://s3.amazonaws.com # or http://<minio_ip>:9000 for Minio
s3_bucket: <bucket_with_artifacts>
s3_access_key: <access_key>
s3_secret_key: <secret_key>

# Equinix Metal
equinix_metal_api_key: <api_key>
equinix_metal_project_id: <existing_project_id>

equinix_metal_hostname_prefix: <prefix> # identify servers in a shared project, e.g. your username and/or OpenShift cluster name

# OpenShift
ocm_offline_token: <offline_token>
pull_secret_path: <path/to/pull_secret>
openshift_base_domain: <base_dns_domain>

# NVIDIA
# from https://ngc.nvidia.com/setup/api-key
ngc_api_key: <ngc_api_key>
ngc_email: <[email protected]>
nls_client_token: <nls_client_license_token> # see https://docs.nvidia.com/license-system/latest/pdf/nvidia-license-system-user-guide.pdf

Other variables that can be changed are declared in the playbooks.

Running

Specify a local_temp_dir that will store the Terraform state, temporary configuration, and credential files.
Specify an openshift_cluster_name for your OpenShift cluster.

ansible-playbook sno.yml -e "@path/to/vars.yml" -e local_temp_dir="/path/to/temp/dir" -e openshift_cluster_name="<cluster_name>"

WARNING: Before running the Ansible playbook, make sure you do not have an old cluster named <openshift_cluster_name> in your Assisted Clusters.

Take a note of the parameters for connecting to the VMware vSphere cluster, OpenShift node(s), etc. For example:

{
    "msg": [
        "Bastion host: 147.28.143.219",
        "vCenter IP: 147.28.142.42",
        "vCenter credentials in tmp/vcenter.txt",
        "VPN connection details in tmp/vpn.txt"
    ]
}

Private vs Public Networks

By default, the vCenter appliance and OpenShift cluster will be deployed with a public network, which means that they will be accessible on the Internet. Although it might be convenient, this is probably not the best choice from a security point of view. You can deploy the VMs in private networks by setting use_private_networks=true, and access vCenter API and the OpenShift cluster only via the bastion host. When using private networks, the OpenShift cluster's kubeconfig, credentials and SSH key pair will be saved to a directory on the bastion host.

Deploying Multiple Setups

It is possible to deploy multiple vSphere clusters, each with its own OpenShift cluster:

Set the local_temp_dir variable so that the Terraform state and everything related to a new cluster is stored separately in a new directory and does not use an existing directory.
Set an openshift_cluster_name that does not conflict with an existing assisted OpenShift cluster.

WARNING: You will need to specify the same directory and cluster name when destroying the cluster, otherwise you may destroy another cluster by mistake.

Cleanup

In most cases it should be enough to:

Destroy the provisioned Equinix Metal machines

terraform destroy --var-file=../tmp/terraform.tfvars

Delete the SNO cluster from the Red Hat Cloud Console

You can also clean up automatically by running:

ansible-playbook destroy.yml -e "@path/to/vars.yml" -e local_temp_dir="/path/to/temp/dir" -e openshift_cluster_name="<cluster_name>"

Terraform destroy logs will be saved in <temp_directory>/terraform-destroy.stdout[.timestamp] and <temp_directory>/terraform-destroy.stderr[.timestamp].

Selecting a vGPU Type (Profile)

IMPORTANT: In order for the GPU Operator to work correctly with a vGPU, a supported vGPU type must be selected in the vgpu_profile variable.

For details on vGPU types refer to Listing Supported vGPU Types.

The list of available vGPU type values can be obtained via the vCenter GUI, or by running vim-cmd hostsvc/hostconfig | grep -A 20 sharedPassthruGpuTypes on an ESXi host after configuring the host graphics.

Choose a vGPU type that supports CUDA — a C- or Q-series vGPU on Tesla V100, which is the GPU offered by the Equinix Metal machine type we use.

On Tesla V100 at the time of this writing:

grid_v100dx-1b
grid_v100dx-2b
grid_v100dx-1b4
grid_v100dx-2b4
grid_v100dx-1q
grid_v100dx-2q
grid_v100dx-4q
grid_v100dx-8q
grid_v100dx-16q
grid_v100dx-32q
grid_v100dx-1a
grid_v100dx-2a
grid_v100dx-4a
grid_v100dx-8a
grid_v100dx-16a
grid_v100dx-32a
grid_v100dx-4c
grid_v100dx-8c
grid_v100dx-16c
grid_v100dx-32c

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
docs		docs
files		files
group_vars		group_vars
roles		roles
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ansible.cfg		ansible.cfg
destroy.yml		destroy.yml
requirements.yml		requirements.yml
s3-atrifacts.yml		s3-atrifacts.yml
s3-server.yml		s3-server.yml
sno.yml		sno.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deploying an OpenShift Cluster with vGPU on VMware vSphere

Requirements

Steps Overview

VMware vSphere 7.0 on Equinix Metal

Required VMware & NVIDIA Artifacts

Setup

Running

Private vs Public Networks

Deploying Multiple Setups

Cleanup

Selecting a vGPU Type (Profile)

See Also

About

Releases

Packages

Languages

License

empovit/openshift-on-vmware-with-vgpu

Folders and files

Latest commit

History

Repository files navigation

Deploying an OpenShift Cluster with vGPU on VMware vSphere

Requirements

Steps Overview

VMware vSphere 7.0 on Equinix Metal

Required VMware & NVIDIA Artifacts

Setup

Running

Private vs Public Networks

Deploying Multiple Setups

Cleanup

Selecting a vGPU Type (Profile)

See Also

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages