Welcome! This guide will help you set up a Slurm cluster running on Kubernetes using Nebius Cloud. The entire setup process is automated with Terraform, allowing you to deploy your cluster with a single command.
Our solution offers several key benefits:
- Effortless Scaling: Add or remove nodes instantly without manual bootstrapping
- Built-in High Availability: Automatic pod restarts and self-healing capabilities
- Unified Storage: Shared root filesystem across all nodes - no more version sync headaches
- Enhanced Security: Isolated environments prevent accidental system breakage
- Automated GPU Health Checks: Regular NCCL tests ensure optimal GPU performance
Before starting, ensure you have these tools installed:
- Terraform
- Nebius CLI
- kubectl
- jq
- coreutils:
- macOS:
brew install coreutils
- Ubuntu:
sudo apt-get install coreutils
- macOS:
- Create Your Installation Directory
export INSTALLATION_NAME=<your-name> # e.g. customer name
mkdir -p installations/$INSTALLATION_NAME
cd installations/$INSTALLATION_NAME
cp -r ../example/ ./
- Set Up Your Environment
Set your NEBIUS_TENANT_ID and NEBIUS_PROJECT_ID in the .envrc
file, then run:
source .envrc
This command loads environment variables and performs several important setup tasks:
- Authenticates with Nebius CLI and exports IAM token
- Creates/retrieves service account for Terraform
- Configures Object Storage access
- Exports necessary resource IDs
- Create Storage Infrastructure
Create a "jail" filesystem in the Nebius Console.
Note
- Configure Your Cluster
Edit terraform.tfvars
with your requirements:
# Use your manually created jail filesystem
filestore_jail = {
existing = {
id = "computefilesystem-<YOUR-FILESYSTEM-ID>"
}
}
# Configure GPU cluster
k8s_cluster_node_group_gpu = {
resource = {
platform = "gpu-h100-sxm"
preset = "8gpu-128vcpu-1600gb"
}
gpu_cluster = {
infiniband_fabric = "fabric-3"
}
}
# Add your SSH public key here to connect to the Slurm cluster
slurm_login_ssh_root_public_keys = [
"ssh-rsa AAAAB3N... your-key"
]
k8s_cluster_node_ssh_access_users
is for connecting to the K8S cluster itself.
You probably don't need this unless you want to manage the K8S cluster manually.
Note
- For large clusters: Use larger presets for CPU-only nodes
- Adjust storage sizes based on your needs
- Contact support to increase quotas if needed
- Ensure SSH keys are added to the correct location
- Deploy Your Cluster
terraform init
terraform apply # This will take ~40 mins
- (Optionally) Verify Kubernetes Setup
- List kubectl contexts to verify that the new cluster was added
kubectl config get-contexts
- Set the new context
kubectl config use-context <your-context-name>
- Verify that you can list the pods in the cluster and there are no pods in the error state
kubectl get pods --all-namespaces
- Verify all resources show green status in the console
- Get Cluster Connection Details
Get the Slurm cluster IP address
export SLURM_IP=$(terraform state show module.login_script.terraform_data.connection_ip | grep 'input' | grep -oE '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' | head -n 1)
ssh root@$SLURM_IP -i ~/.ssh/<private_id_rsa_key>
or connect using the login script:
./login.sh -k ~/.ssh/id_rsa
Copy the test files to the Slurm cluster:
cd soperator/test
./prepare_for_quickcheck.sh -u root -k ~/.ssh/<private_id_rsa_key> -a $SLURM_IP
Connect to the Slurm cluster and run the tests:
ssh root@$SLURM_IP
cd /quickcheck
# Basic Slurm test
sbatch hello.sh
tail -f outputs/hello.out
# GPU interconnect test
sbatch nccl.sh
tail -f outputs/nccl.out
# Container test
sbatch enroot.sh
tail -f outputs/enroot.out