Skip to content

Latest commit

 

History

History
166 lines (124 loc) · 4.51 KB

README.md

File metadata and controls

166 lines (124 loc) · 4.51 KB

Slurm on Kubernetes with Soperator - Installation Guide

Welcome! This guide will help you set up a Slurm cluster running on Kubernetes using Nebius Cloud. The entire setup process is automated with Terraform, allowing you to deploy your cluster with a single command.

Why Run Slurm on Kubernetes?

Our solution offers several key benefits:

  • Effortless Scaling: Add or remove nodes instantly without manual bootstrapping
  • Built-in High Availability: Automatic pod restarts and self-healing capabilities
  • Unified Storage: Shared root filesystem across all nodes - no more version sync headaches
  • Enhanced Security: Isolated environments prevent accidental system breakage
  • Automated GPU Health Checks: Regular NCCL tests ensure optimal GPU performance

Prerequisites

Before starting, ensure you have these tools installed:

Installation Steps

  1. Create Your Installation Directory
export INSTALLATION_NAME=<your-name> # e.g. customer name
mkdir -p installations/$INSTALLATION_NAME
cd installations/$INSTALLATION_NAME
cp -r ../example/ ./
  1. Set Up Your Environment

Set your NEBIUS_TENANT_ID and NEBIUS_PROJECT_ID in the .envrc file, then run:

source .envrc

This command loads environment variables and performs several important setup tasks:

  • Authenticates with Nebius CLI and exports IAM token
  • Creates/retrieves service account for Terraform
  • Configures Object Storage access
  • Exports necessary resource IDs
  1. Create Storage Infrastructure

Create a "jail" filesystem in the Nebius Console.

Create Filesystem 1 Create Filesystem 2

Note

  • For storage > 2 TiB: Contact Nebius Support (in the web console) to enable multitablet functionality
  • Note down the filesystem ID for your terraform configuration Create Filesystem 2
  1. Configure Your Cluster

Edit terraform.tfvars with your requirements:

# Use your manually created jail filesystem
filestore_jail = {
  existing = {
    id = "computefilesystem-<YOUR-FILESYSTEM-ID>"
  }
}

# Configure GPU cluster
k8s_cluster_node_group_gpu = {
  resource = {
    platform = "gpu-h100-sxm"
    preset   = "8gpu-128vcpu-1600gb"
  }
  gpu_cluster = {
    infiniband_fabric = "fabric-3"
  }
}

# Add your SSH public key here to connect to the Slurm cluster 
slurm_login_ssh_root_public_keys = [
  "ssh-rsa AAAAB3N... your-key"
]

k8s_cluster_node_ssh_access_users is for connecting to the K8S cluster itself. You probably don't need this unless you want to manage the K8S cluster manually.

Note

  • For large clusters: Use larger presets for CPU-only nodes
  • Adjust storage sizes based on your needs
  • Contact support to increase quotas if needed
  • Ensure SSH keys are added to the correct location
  1. Deploy Your Cluster
terraform init
terraform apply # This will take ~40 mins
  1. (Optionally) Verify Kubernetes Setup
  • List kubectl contexts to verify that the new cluster was added
kubectl config get-contexts
  • Set the new context
kubectl config use-context <your-context-name>
  • Verify that you can list the pods in the cluster and there are no pods in the error state
kubectl get pods --all-namespaces
  • Verify all resources show green status in the console
  1. Get Cluster Connection Details

Get the Slurm cluster IP address

export SLURM_IP=$(terraform state show module.login_script.terraform_data.connection_ip | grep 'input' | grep -oE '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' | head -n 1)
ssh root@$SLURM_IP -i ~/.ssh/<private_id_rsa_key>

or connect using the login script:

./login.sh -k ~/.ssh/id_rsa

(Optional) Test Your Installation

Copy the test files to the Slurm cluster:

cd soperator/test
./prepare_for_quickcheck.sh -u root -k ~/.ssh/<private_id_rsa_key> -a $SLURM_IP

Connect to the Slurm cluster and run the tests:

ssh root@$SLURM_IP
cd /quickcheck
# Basic Slurm test
sbatch hello.sh
tail -f outputs/hello.out    
# GPU interconnect test
sbatch nccl.sh
tail -f outputs/nccl.out
# Container test
sbatch enroot.sh
tail -f outputs/enroot.out