Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add disk name suffix to be able to create several nfs disks in the same project #193

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
841f7d0
fix publix ip assignment for gpu nodes
Dec 18, 2024
06e7be0
fix storage
Dec 18, 2024
7d399c3
Change terraform provider source
Dec 19, 2024
b604b27
refactor into module
Dec 27, 2024
4051141
refactor into module
Dec 27, 2024
d4d7b1e
dsvm initial commit
Dec 27, 2024
c240c3b
Output variables fix
Dec 27, 2024
f3f2173
add shared filesystems
Jan 2, 2025
a3543ad
add multiuser support
Jan 3, 2025
33e8d3f
add multiuser support
Jan 3, 2025
00dc061
Updated README in observabilty section for mk8s to contain links to a…
shoguevara Jan 7, 2025
1c2c555
Update README.md in observability section for k8s-training
shoguevara Jan 7, 2025
b705104
Merge pull request #145 from nebius/fix-readme-loki-sa-accesskey
elijah-k-nebius Jan 7, 2025
d04a6d7
Merge pull request #151 from nebius/release/soperator
Uburro Jan 9, 2025
5cddf73
clean up a bit
Jan 14, 2025
f06db5b
TECHDOCS-963: update the Wireguard solution description
Jan 14, 2025
1869527
clean up a bit
Jan 14, 2025
75a4d61
Merge pull request #122 from nebius/fix/change-tfprovider-source
malibora Jan 20, 2025
0e11caf
First text version
Jan 20, 2025
7b4da4c
Improvements
Jan 20, 2025
e182352
fix: minimal fixes for eu-west1 (#172)
PhilipMantrov Jan 21, 2025
eb5435d
Update terraform.yml
jadnov Jan 23, 2025
9919683
Edits after an internal review
Jan 23, 2025
63bfea3
add dynamic tenant selection to envrc
Jan 24, 2025
737dbcf
move files into sub folder
Jan 24, 2025
f78d8bb
move files into sub folder
Jan 24, 2025
8f9d9a7
Feature/bastion module (#174)
jadnov Jan 29, 2025
217f291
test added
Jan 29, 2025
46c6fcb
NFS Server disk mount fixed; (#175)
elijah-k-nebius Jan 29, 2025
5994bb2
update image rbac proxy
Uburro Jan 20, 2025
bc405c0
Merge pull request #179 from nebius/soperator-release-1.17.0-2
Uburro Jan 30, 2025
8eeb945
added s3 config inside of the vm
Jan 30, 2025
9db28f7
Update provider.tf
jadnov Jan 31, 2025
c47fe9f
refactoring
Jan 31, 2025
09e9ecd
changed storage url for cli, wireguard
Jan 31, 2025
69cc3a7
Merge pull request #180 from nebius/feature/new-storage-url
malibora Jan 31, 2025
a6015f3
Merge branch 'main' into feature/TECHDOCS-963-Wireguard
alena-linki Feb 3, 2025
3b10c6c
Fixed wireguard executable url to point to the raw executable to down…
panukosk Feb 3, 2025
b0c6de0
Merge pull request #184 from nebius/fix/wireguard-executable-url
malibora Feb 4, 2025
b421ba4
Merge pull request #159 from nebius/feature/simple-solutions
malibora Feb 4, 2025
4d9754a
Merge pull request #158 from nebius/feature/TECHDOCS-963-Wireguard
malibora Feb 4, 2025
05b5d4c
Merge branch 'main' into feature/dsvm
malibora Feb 4, 2025
8c08aa9
Merge pull request #138 from nebius/feature/dsvm
malibora Feb 4, 2025
b04e5e8
change owner rights in soperator path
Uburro Feb 4, 2025
ecefd72
Merge pull request #188 from nebius/change-owner-rights-soperator
Uburro Feb 4, 2025
f694909
deleted slurm-compute module
Feb 5, 2025
6fa561f
Merge pull request #192 from nebius/feature/delete-slurm-compute-b
malibora Feb 5, 2025
c400c60
Add disk name suffix to be able to create several nfs disks in the sa…
itechdima Feb 6, 2025
e16000e
Merge branch 'release/soperator' into 189-support-several-nfs-servers…
rdjjke Feb 10, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions .github/workflows/terraform.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ concurrency:

env:
TF_VAR_parent_id: project-e00pjzzrtk1fs3yavy
TF_VAR_tenant_id: tenant-e00f3wdfzwfjgbcyfv

jobs:
terraform:
Expand All @@ -29,8 +30,9 @@ jobs:
solution:
- name: k8s-inference
- name: k8s-training
- name: slurm
- name: wireguard
- name: dsvm
- name: bastion

defaults:
run:
Expand Down Expand Up @@ -72,7 +74,7 @@ jobs:

- name: Install Nebius CLI
run: |
curl -sSL https://storage.ai.nebius.cloud/nebius/install.sh | bash
curl -sSL https://storage.eu-north1.nebius.cloud/cli/install.sh | bash
echo "${HOME}/.nebius/bin" >> $GITHUB_PATH

- name: Nebius CLI init
Expand Down Expand Up @@ -170,7 +172,7 @@ jobs:

- name: Install Nebius CLI
run: |
curl -sSL https://storage.ai.nebius.cloud/nebius/install.sh | bash
curl -sSL https://storage.eu-north1.nebius.cloud/cli/install.sh | bash
echo "${HOME}/.nebius/bin" >> $GITHUB_PATH

- name: Nebius CLI init
Expand Down
2 changes: 1 addition & 1 deletion CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@
* @malibora @jadnov @elijah-k-nebius @rdjjke @asteny

.github/workflows @malibora @d3vil-st @elijah-k-nebius
soperator @dstaroff @asteny @rdjjke @Uburro
soperator/ @dstaroff @asteny @rdjjke @Uburro
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ This repository is a curated collection of Terraform and Helm solutions designed

For those who prefer containerized environments, our Kubernetes solution includes GPU-Operator and Network-Operator. This setup ensures that your training workloads use dedicated GPU resources and optimized network configurations, both of which are critical components for AI models that require a lot of computational power. . GPU-Operator simplifies the management of NVIDIA GPUs, automating the deployment of necessary drivers and plugins. Similarly, the Network-Operator improves network performance, ensuring seamless communication throughout your cluster. The cluster uses InfiniBand technology, which provides the fastest host connections for data-intensive tasks.

[SLURM](./slurm/README.md)
[SLURM](./soperator/README.md)

Our SLURM solutions offer a streamlined approach for users who prefer traditional HPC environments. These solutions include ready-to-use images pre-configured with NVIDIA drivers and are ideal for those looking to take advantage of SLURM’s robust job scheduling capabilities. Similar to our Kubernetes offerings, the SLURM solutions are optimized for InfiniBand connectivity, ensuring peak performance and efficiency in data transfer and communication between nodes.

Expand Down
140 changes: 140 additions & 0 deletions bastion/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
# Bastion instance

This Terraform solution deploys a Bastion instance that serves as a secure jump host for your infrastructure.
It improves the security by minimizing the use of Public IPs and limiting access to the rest of the environment.

Also create a Service Account with generated Auhorization key pair to authentificate Nebius CLI on the host.

Also installed on the host:
- Wireguard VPN solution with UI
- Nebius CLI and configured with profile authentificated by Service account
- kubectl and configured to connect to first mk8s cluster available in project by --internal flag
(scanned by: `nebius mk8s v1 cluster list`)

## How to connect over bastion

### Edit you local ssh config

`~/.ssh/config`

```
Host bastion
HostName <public_ip_of_bastion_host>
User bastion
IdentityFile ~/.ssh/private.key

Host target
HostName <private_ip_of_host_after_bastion>
User ubuntu
IdentityFile ~/.ssh/private.key
ProxyJump bastion
```

### Login to remote VM behind bastion
```
ssh target
```

## Prerequisites

1. Install [Nebius CLI](https://docs.nebius.dev/en/cli/#installation):
```bash
curl -sSL https://storage.eu-north1.nebius.cloud/cli/install.sh | bash
```

2. Reload your shell session:

```bash
exec -l $SHELL
```

or

```bash
source ~/.bashrc
```

3. [Configure](https://docs.nebius.ai/cli/configure/) Nebius CLI (we recommend using [service account](https://docs.nebius.ai/iam/service-accounts/manage/)):
```bash
nebius init
```

4. Install JQuery (for Debian-based distributions):
```bash
sudo apt install jq -y
```

## Installation

To deploy the solution, follow these steps:

1. Load environment variables:
```bash
source ./environment.sh
```
2. Initialize Terraform:
```bash
terraform init
```
3. Replace the placeholder content in `terraform.tfvars` with the configuration values that you need. See the details [below](#configuration-variables).
4. Preview the deployment plan:
```bash
terraform plan
```
5. Apply the configuration:
```bash
terraform apply
```
Wait for the operation to complete.

## Configuration variables

Update the following variables in the `terraform.tfvars` file with your own values:

- `tenant-id`
- `parent_id`
- `subnet_id`
- `ssh_user_name`
- `ssh_public_key`

## Creating and using a public IP allocation

This step allows you to retain the IP address even if the VM is deleted. If you don’t need to keep the IP adress, skip section.

1. Create a public IP allocation:
```bash
nebius vpc v1 allocation create --ipv-4-public \
--parent-id <project-id> --name wireguard_allocation_pub \
--format json | jq -r '.metadata.id'
```
2. Assign the value from the previous step to the `public_ip_allocation_id` variable in [variables.tf](./variables.tf):

```bash
public_ip_allocation_id = <public_ip_allocation_id>
```

### Logging into Wireguard UI

1. SSH into the Wireguard instance:
```bash
ssh -i <path_to_private_ssh_key> <ssh_user_name>@<instance_public_ip>
```

2. Retrieve the Wireguard UI password:
```bash
sudo cat /var/lib/wireguard-ui/initial_password
```

3. Open the Wireguard UI in your browser:
```
http://<instance_public_ip>:5000
```

4. Log in with the following credentials:
- **Username:** `admin`
- **Password:** [password retrieved in step 2]

### Notes

- **Apply Config:** After creating, deleting or changing Wireguard users, select "Apply Config".
- **Allowed IPs:** When adding new users, specify the CIDRs of your existing infrastructure in the "Allowed IPs" field.
8 changes: 8 additions & 0 deletions bastion/disks.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
resource "nebius_compute_v1_disk" "bastion-boot-disk" {
parent_id = var.parent_id
name = "bastion-boot-disk"
block_size_bytes = 4096
size_bytes = 64424509440
type = "NETWORK_SSD"
source_image_family = { image_family = "ubuntu22.04-driverless" }
}
1 change: 1 addition & 0 deletions slurm/environment.sh → bastion/environment.sh
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
#/bin/sh
unset NEBIUS_IAM_TOKEN
export NEBIUS_IAM_TOKEN=$(nebius iam get-access-token)
export TF_VAR_iam_token=$NEBIUS_IAM_TOKEN
4 changes: 4 additions & 0 deletions bastion/locals.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
locals {
ssh_public_key = var.ssh_public_key.key != null ? var.ssh_public_key.key : (
fileexists(var.ssh_public_key.path) ? file(var.ssh_public_key.path) : null)
}
32 changes: 32 additions & 0 deletions bastion/main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
resource "nebius_compute_v1_instance" "bastion_instance" {
parent_id = var.parent_id
name = "bastion-instance"

boot_disk = {
attach_mode = "READ_WRITE"
existing_disk = nebius_compute_v1_disk.bastion-boot-disk
}

network_interfaces = [
{
name = "eth0"
subnet_id = var.subnet_id
ip_address = {}
public_ip_address = {}
}
]

resources = {
platform = "cpu-e2"
preset = "4vcpu-16gb"
}

cloud_init_user_data = templatefile("../modules/cloud-init/bastion-cloud-init.tftpl", {
ssh_user_name = var.ssh_user_name
ssh_public_key = local.ssh_public_key
sa_private_key = local.sa_private_key
parent_id = var.parent_id
sa_public_key_id = local.sa_public_key_id
service_account_id = local.service_account_id
})
}
6 changes: 6 additions & 0 deletions bastion/output.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
output "bastion_host_public_ip" {
value = trimsuffix(nebius_compute_v1_instance.bastion_instance.status.network_interfaces[0].public_ip_address.address, "/32")
}
output "bastion_service_account" {
value = nebius_iam_v1_service_account.bastion-sa.id
}
11 changes: 11 additions & 0 deletions bastion/provider.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
terraform {
required_providers {
nebius = {
source = "terraform-provider.storage.eu-north1.nebius.cloud/nebius/nebius"
}
}
}

provider "nebius" {
domain = "api.eu.nebius.cloud:443"
}
37 changes: 37 additions & 0 deletions bastion/sa.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
resource "tls_private_key" "bastion_sa_key" {
algorithm = "RSA"
rsa_bits = 4096
}

resource "nebius_iam_v1_service_account" "bastion-sa" {
parent_id = var.parent_id
name = "bastion-sa"
}

data "nebius_iam_v1_group" "admins-group" {
name = "editors"
parent_id = var.tenant_id
}

resource "nebius_iam_v1_group_membership" "bastion-sa-admin" {
parent_id = data.nebius_iam_v1_group.admins-group.id
member_id = nebius_iam_v1_service_account.bastion-sa.id
}

resource "nebius_iam_v1_auth_public_key" "bastion-sa-public-key" {
parent_id = var.parent_id
expires_at = timeadd(timestamp(), "8760h") # 1 Year expiration time
account = {
service_account = {
id = nebius_iam_v1_service_account.bastion-sa.id
}
}
data = tls_private_key.bastion_sa_key.public_key_pem
}

locals {
sa_public_key = tls_private_key.bastion_sa_key.public_key_pem
sa_private_key = tls_private_key.bastion_sa_key.private_key_pem
sa_public_key_id = nebius_iam_v1_auth_public_key.bastion-sa-public-key.id
service_account_id = nebius_iam_v1_service_account.bastion-sa.id
}
8 changes: 8 additions & 0 deletions bastion/terraform.tfvars
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# tenant-id = ""
# parent_id = ""
# subnet_id = ""
# ssh_user_name = "bastion"
# ssh_public_key = {
# key = "put your public ssh key here"
# path = "put path to ssh key here"
# }
22 changes: 22 additions & 0 deletions bastion/test-resource.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
locals {
test_bst_host = trimsuffix(nebius_compute_v1_instance.bastion_instance.status.network_interfaces[0].public_ip_address.address, "/32")
}

resource "null_resource" "check_bastion_instance" {
count = var.test_mode ? 1 : 0

connection {
user = var.ssh_user_name
host = local.test_bst_host
}

provisioner "remote-exec" {
inline = [
"set -eu",
"cloud-init status --wait",
"ip link show wg0",
"systemctl -q status [email protected] > /dev/null",
".nebius/bin/nebius iam whoami > /dev/null"
]
}
}
7 changes: 7 additions & 0 deletions bastion/tests/main.tftest.hcl
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
run "test_mode_bastion_apply" {
command = apply

variables {
test_mode = true
}
}
47 changes: 47 additions & 0 deletions bastion/variables.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
variable "tenant_id" {
description = "Tenant ID."
type = string
}

variable "parent_id" {
description = "Project ID."
type = string
}

variable "subnet_id" {
description = "Subnet ID."
type = string
}

# SSH access
variable "ssh_user_name" {
description = "SSH username."
type = string
default = "bastion"
}

variable "ssh_public_key" {
description = "SSH Public Key to access the cluster nodes."
type = object({
key = optional(string),
path = optional(string, "~/.ssh/id_rsa.pub")
})
default = {}
validation {
condition = var.ssh_public_key.key != null || fileexists(var.ssh_public_key.path)
error_message = "SSH Public Key must be set by `key` or file `path` ${var.ssh_public_key.path}"
}
}

# Access By IP
variable "public_ip_allocation_id" {
description = "Id of a manually created public_ip_allocation."
type = string
default = null
}

variable "test_mode" {
description = "Switch between real usage and testing."
type = bool
default = false
}
Binary file added bastion/wireguard-ui
Binary file not shown.
Loading