·
21 commits
to main
since this release
Soperator release v1.18.0-1 (#198) * Add support for multipe ssh keys in Soperator NFS module * change mount path to home * Update soperator/installations/example/main.tf Co-authored-by: Dmitry Starov <[email protected]> * MSP-3514: fix collect logs from workers * add maintenance and use_default_apparmor_profile * bump-prom-0.79.2 * Bump cert-manager v1.16.2 * [ADD] Editorconfig * [ENH] Get rid of legacy ENV usage * [ENH] Allow skipping data downloading * [ENH] Put logs into result dir * [ADD] Support for MLFlow logging * [ADD] Create results directory * [ENH] Get rid of explicit log dir * [ENH] Justify MLFlow tags * [ENH] Handle `cfg` not present in `params` * [ADD] Implement custom block metrics for benchmark * [ADD] Example MLFlow env var setting script * [ENH] Use constant seed * [ENH] Don't wait for metrics to be sent * [ADD] Introduce `samples_per_training_step` param * [ADD] Introduce `timeToRun` metric * [ENH] Justify params, metrics and keys * [ADD] Support for exporting metrics from MetricsLogger * [ADD] Implement extra metrics Params: - training/samples - validation/samples - validation/block/size - validation/block/samples - validation/step/samples Metrics: - duration/training/block - samplesPerSecond/training/step - samplesPerSecond/validation/block - samplesPerSecond/validation/step * [ENH] Remove redundant step metric collection from logger + fix calculation * [ENH] Remove incorrect H200 config [MSP-3612] * [FIX] Metric names not working during MLFlow's local file store exporting * [DOC] Support skipping data download jobs on init * [DOC] Provide docs for running GPT3 benchmark with MLFlow * [FIX] Get subnet's CIDR from status rather than from spec [MSP-3950] * MSP-3518: add slurmdbd_config and slurm_config configurable * [ENH] Remove redundant repo downloading for SD and unify it with GPT3 [MSP-3920] * added variable for region * update image rbac proxy * [ENH] Precise calculation of allocatable CPU & RAM * [ENH] Take less ephemeral storage for reserve * [ADD] Introduce `region` variable [MSP-3541] * [ADD] Introduce regional support for platforms [MSP-4025] * [ADD] Check for regional support for platforms [MSP-4025] * [ADD] Check for minimal boot disk size [MSP-3632] * [REM] Get rid of NodePort support [MSP-4043] * [ENH] Add tflint configs * [ADD] Provide static IP for login service [MSP-3801] * [FIX] Set real worker memory size as default real memory size for nodes in config * MSP-4096: fix bugs with dependency mariadb-operator from monitoring stack * change of registry * bump subversion 3 * Add disk name suffix to be able to create several nfs disks in the same project * Add itechdima to Soperator codeowners * NOTIC: Improve default tfvars * NOTIC: Make slurm_operator_stable configurable via terraform.tfvars * NOTIC: Create nfs-server before k8s cluster * NOTIC: Add instructions for opening logs explorer to nebius o11y motd * NOTIC: Scrape more samples by default, and collect less metrics for kube-state-metrics * 190: Deploy k8up operator * NOTIC: Allocate more resources for MariaDB * 424: Use protected secret for MariaDB by default * NOTIC: Drop personal installations from .gitignore * NOTIC: Round up node memory limits to integer number of GiBs * NOTIC: Bump Soperator version 1.17.0-3 -> 1.18.0-1 --------- Co-authored-by: Alex Kim <[email protected]> Co-authored-by: Dmitry Starov <[email protected]> Co-authored-by: Uburro <[email protected]> Co-authored-by: Grigorii Rochev <[email protected]> Co-authored-by: Boris Popov <[email protected]> Co-authored-by: Dzmitry Amialiusik <[email protected]>