Slurm on Kubernetes with Soperator - Installation Guide

Welcome! This guide will help you set up a Slurm cluster running on Kubernetes using Nebius Cloud. The entire setup process is automated with Terraform, allowing you to deploy your cluster with a single command.

Why Run Slurm on Kubernetes?

Our solution offers several key benefits:

Effortless Scaling: Add or remove nodes instantly without manual bootstrapping
Built-in High Availability: Automatic pod restarts and self-healing capabilities
Unified Storage: Shared root filesystem across all nodes - no more version sync headaches
Enhanced Security: Isolated environments prevent accidental system breakage
Automated GPU Health Checks: Regular NCCL tests ensure optimal GPU performance

Prerequisites

Before starting, ensure you have these tools installed:

Terraform
Nebius CLI
kubectl
jq
coreutils:
- macOS: brew install coreutils
- Ubuntu: sudo apt-get install coreutils

Installation Steps

Create Your Installation Directory

export INSTALLATION_NAME=<your-name> # e.g. customer name
mkdir -p installations/$INSTALLATION_NAME
cd installations/$INSTALLATION_NAME
cp -r ../example/ ./

Set Up Your Environment

Set your NEBIUS_TENANT_ID and NEBIUS_PROJECT_ID in the .envrc file, then run:

source .envrc

This command loads environment variables and performs several important setup tasks:

Authenticates with Nebius CLI and exports IAM token
Creates/retrieves service account for Terraform
Configures Object Storage access
Exports necessary resource IDs

Create Storage Infrastructure

Create a "jail" filesystem in the Nebius Console.

Note

For storage > 2 TiB: Contact Nebius Support (in the web console) to enable multitablet functionality
Note down the filesystem ID for your terraform configuration

Configure Your Cluster

Edit terraform.tfvars with your requirements:

# Use your manually created jail filesystem
filestore_jail = {
  existing = {
    id = "computefilesystem-<YOUR-FILESYSTEM-ID>"
  }
}

# Configure GPU cluster
k8s_cluster_node_group_gpu = {
  resource = {
    platform = "gpu-h100-sxm"
    preset   = "8gpu-128vcpu-1600gb"
  }
  gpu_cluster = {
    infiniband_fabric = "fabric-3"
  }
}

# Add your SSH public key here to connect to the Slurm cluster 
slurm_login_ssh_root_public_keys = [
  "ssh-rsa AAAAB3N... your-key"
]

k8s_cluster_node_ssh_access_users is for connecting to the K8S cluster itself. You probably don't need this unless you want to manage the K8S cluster manually.

Note

For large clusters: Use larger presets for CPU-only nodes
Adjust storage sizes based on your needs
Contact support to increase quotas if needed
Ensure SSH keys are added to the correct location

Deploy Your Cluster

terraform init
terraform apply # This will take ~40 mins

(Optionally) Verify Kubernetes Setup

List kubectl contexts to verify that the new cluster was added

kubectl config get-contexts

Set the new context

kubectl config use-context <your-context-name>

Verify that you can list the pods in the cluster and there are no pods in the error state

kubectl get pods --all-namespaces

Verify all resources show green status in the console

Get Cluster Connection Details

Get the Slurm cluster IP address

export SLURM_IP=$(terraform state show module.login_script.terraform_data.connection_ip | grep 'input' | grep -oE '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' | head -n 1)
ssh root@$SLURM_IP -i ~/.ssh/<private_id_rsa_key>

or connect using the login script:

./login.sh -k ~/.ssh/id_rsa

(Optional) Test Your Installation

Copy the test files to the Slurm cluster:

cd soperator/test
./prepare_for_quickcheck.sh -u root -k ~/.ssh/<private_id_rsa_key> -a $SLURM_IP

Connect to the Slurm cluster and run the tests:

ssh root@$SLURM_IP
cd /quickcheck
# Basic Slurm test
sbatch hello.sh
tail -f outputs/hello.out    
# GPU interconnect test
sbatch nccl.sh
tail -f outputs/nccl.out
# Container test
sbatch enroot.sh
tail -f outputs/enroot.out

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Slurm on Kubernetes with Soperator - Installation Guide

Why Run Slurm on Kubernetes?

Prerequisites

Installation Steps

(Optional) Test Your Installation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Slurm on Kubernetes with Soperator - Installation Guide

Why Run Slurm on Kubernetes?

Prerequisites

Installation Steps

(Optional) Test Your Installation