Skip to content

Soperator Terraform recipe v1.18.0-1

Latest
Compare
Choose a tag to compare
@github-actions github-actions released this 13 Feb 20:52
· 21 commits to main since this release
fc795b9
Soperator release v1.18.0-1 (#198)

* Add support for multipe ssh keys in Soperator NFS module

* change mount path to home

* Update soperator/installations/example/main.tf

Co-authored-by: Dmitry Starov <[email protected]>

* MSP-3514: fix collect logs from workers

* add maintenance and use_default_apparmor_profile

* bump-prom-0.79.2

* Bump cert-manager v1.16.2

* [ADD] Editorconfig

* [ENH] Get rid of legacy ENV usage

* [ENH] Allow skipping data downloading

* [ENH] Put logs into result dir

* [ADD] Support for MLFlow logging

* [ADD] Create results directory

* [ENH] Get rid of explicit log dir

* [ENH] Justify MLFlow tags

* [ENH] Handle `cfg` not present in `params`

* [ADD] Implement custom block metrics for benchmark

* [ADD] Example MLFlow env var setting script

* [ENH] Use constant seed

* [ENH] Don't wait for metrics to be sent

* [ADD] Introduce `samples_per_training_step` param

* [ADD] Introduce `timeToRun` metric

* [ENH] Justify params, metrics and keys

* [ADD] Support for exporting metrics from MetricsLogger

* [ADD] Implement extra metrics

Params:
- training/samples
- validation/samples
- validation/block/size
- validation/block/samples
- validation/step/samples

Metrics:
- duration/training/block
- samplesPerSecond/training/step
- samplesPerSecond/validation/block
- samplesPerSecond/validation/step

* [ENH] Remove redundant step metric collection from logger + fix calculation

* [ENH] Remove incorrect H200 config [MSP-3612]

* [FIX] Metric names not working during MLFlow's local file store exporting

* [DOC] Support skipping data download jobs on init

* [DOC] Provide docs for running GPT3 benchmark with MLFlow

* [FIX] Get subnet's CIDR from status rather than from spec [MSP-3950]

* MSP-3518: add slurmdbd_config and slurm_config configurable

* [ENH] Remove redundant repo downloading for SD and unify it with GPT3 [MSP-3920]

* added variable for region

* update image rbac proxy

* [ENH] Precise calculation of allocatable CPU & RAM

* [ENH] Take less ephemeral storage for reserve

* [ADD] Introduce `region` variable [MSP-3541]

* [ADD] Introduce regional support for platforms [MSP-4025]

* [ADD] Check for regional support for platforms [MSP-4025]

* [ADD] Check for minimal boot disk size [MSP-3632]

* [REM] Get rid of NodePort support [MSP-4043]

* [ENH] Add tflint configs

* [ADD] Provide static IP for login service [MSP-3801]

* [FIX] Set real worker memory size as default real memory size for nodes in config

* MSP-4096: fix bugs with dependency mariadb-operator from monitoring stack

* change of registry

* bump subversion 3

* Add disk name suffix to be able to create several nfs disks in the same project

* Add itechdima to Soperator codeowners

* NOTIC: Improve default tfvars

* NOTIC: Make slurm_operator_stable configurable via terraform.tfvars

* NOTIC: Create nfs-server before k8s cluster

* NOTIC: Add instructions for opening logs explorer to nebius o11y motd

* NOTIC: Scrape more samples by default, and collect less metrics for kube-state-metrics

* 190: Deploy k8up operator

* NOTIC: Allocate more resources for MariaDB

* 424: Use protected secret for MariaDB by default

* NOTIC: Drop personal installations from .gitignore

* NOTIC: Round up node memory limits to integer number of GiBs

* NOTIC: Bump Soperator version 1.17.0-3 -> 1.18.0-1

---------

Co-authored-by: Alex Kim <[email protected]>
Co-authored-by: Dmitry Starov <[email protected]>
Co-authored-by: Uburro <[email protected]>
Co-authored-by: Grigorii Rochev <[email protected]>
Co-authored-by: Boris Popov <[email protected]>
Co-authored-by: Dzmitry Amialiusik <[email protected]>