Releases: GoogleCloudPlatform/cluster-toolkit
Releases · GoogleCloudPlatform/cluster-toolkit
v1.45.1 Fix XPK issues running glcuster in container environments
Release v1.45.0
Highlights:
- A3 Ultra GKE blueprints updated to use Kueue 0.10.0 and Jobset 0.7.2 which are now supported.
- Module improvements to support GKE cluster deletion protection, default node pools with shielded instances, latest GKE version in Rapid channel for A3 Ultra clusters, configurable upgrade settings for node pools and managed hyperdisk support.
- Example for running NVIDIA NeMo on a3-ultragpu-8g Slurm clusters
What's Changed
Key New Features 🎉
- Integrating kueue v0.10.0 to enable TAS with rank ordering support by @ighosh98 in #3417
- Add max_distance variable by @alyssa-sm in #3413
- Enable hierarchical namespace support in cloud-storage-bucket module by @SwarnaBharathiMantena in #3513
- Add NeMo framework example to a3-Ultra by @akiki-liang0 in #3477
- Remove Slurm-GCP v5 modules from Cluster Toolkit, refer modules/README.md
#3497
Module Improvements 🔨
- Expose cluster deletion protection by @annuay-google in #3392
- Parallelstore striping config by @dgouju in #3333
- Make upgrade settings configurable for gke-cluster by @ighosh98 in #3462
- Add kubectl provider in root module for blueprint with GKE cluster module setup by @mohitchaurasia91 in #3406
- add GKE support for managed hyperdisk by @chengcongdu in #3476
- Add shielded instance config to default pool by @annuay-google in #3507
- Add support for Redhat 7, 8 and 9 to startup-script Anisble by @wiktorn in #3487
Improvements 🛠
- Enable reservations support for kueue integration tests by @ighosh98 in #3424
- Update Kueue TAS Test Definition and add Kueue v0.10.0 toleration by @ighosh98 in #3425
- Update README with GKE parallelstore related example blueprint details by @mohitchaurasia91 in #3409
- Upgrade a3-ultra to use kueue v0.10.0 by @ighosh98 in #3438
- A3 Ultra Integration tests by @ighosh98 in #3453
- Update A3U blueprint to remove commit refs and remove hardcoded network names by @ighosh98 in #3456
- Add compact placement validations by @parulbajaj01 in #3439
- Allow customization of Parallelstore mounts by @wiktorn in #3144
- Add lifecycle rule to ignore local SSDs by @chajath in #3450
- update a3mega nccl plugin to 1.0.7 and rxdm to 1.0.13_1 by @chengcongdu in #3466
- Enable optional creation of cloud router/nat for vpcs by @abbas1902 in #3499
- Use version prefix in conjunction with release channels by @annuay-google in #3520
- Bump jobset version to 0.7.2 and remove v0.7.1 as valid version by @ankitkinra in #3517
- Include MemSpecLimit when calculating defmem by @wiktorn in #3300
Deprecations 💤
- Remove slurm-gcp v5 tests by @harshthakkar01 in #3493
- Remove slurm-gcp v5 examples and update documentation by @harshthakkar01 in #3494
- Remove Slurm-gcp v5 modules and update documentation by @harshthakkar01 in #3497
Bug fixes 🐞
- Fix GKE parallelstore blueprint name going beyond network char limit by @mohitchaurasia91 in #3432
- Updated ansible playbook test file name by @mohitchaurasia91 in #3433
- TAS Plugin Bug fix by @ighosh98 in #3449
- Update mount-daos.sh by @samskillman in #3457
- Placement policy null condition checks added by @ighosh98 in #3459
New Contributors
- @SwarnaBharathiMantena made their first contribution in #3377
- @chajath made their first contribution in #3451
- @parulbajaj01 made their first contribution in #3439
Full Changelog: v1.44.2...v1.45.0
v1.44.2: Fix for Slurm autoscaler support for future reservations
Release v1.44.1: Support for a3-ultragpu-8g VMs and GKE, Slurm clusters
Release notes v1.44.1
This release announces Toolkit support for the new A3 Ultra machine type from Google Cloud. This machine type includes 8 NVIDIA H200 GPUs each with dedicated CX-7 networking with RDMA support via RoCE.
The release includes 4 blueprints that maximize performance for the machine type:
- A simple Slurm blueprint provisioning A3 Ultra compute nodes with a shared Filestore /home
- A GKE blueprint that provisions an A3 Ultra compute node pool
- An advanced Slurm blueprint that additionally mounts a GCS bucket with performance-optimized caching settings for I/O and checkpointing.
- A blueprint that provisions A3 Ultra compute nodes as VM instances (no scheduler) with RDMA networking
Example solutions using NCCL are provided for blueprints running under a scheduler.
v1.44.0: Future Reservations in Slurm, Topology Aware GKE, Expanded GPU RDMA Support
What's Changed
Key New Features 🎉
- update terraform provider to 6.12.0 by @ighosh98 in #3356
- Add future reservation support by @abbas1902 in #3227
- Update terraform provider to 6.13.0 by @alyssa-sm in #3367
- GKE clusters can now be duplicated by changing only the deployment name by @annuay-google in #3322
- GPU-VPC module by @cdunbar13 in #3391
Module Improvements 🔨
- Add dynamic setup of gpu_limit in gke-job-template module by @mohitchaurasia91 in #3319
- Add reservations to vm-instance by @cdunbar13 in #3327
- Revert "integrate tas plugin bug fixes" by @ighosh98 in #3344
- Add change to conditionally perform pip install based on gke_node_pool machine_type by @mohitchaurasia91 in #3341
- Update gke-cluster module addon-config for enabling parallelstrore csi driver by @mohitchaurasia91 in #3357
- make upgrade settings configurable by @ighosh98 in #3359
- gke v1.31 added to acceptable list for a3-mega by @sharabiani in #3388
- An option added to disable/enable workload script execution by @sharabiani in #3389
- Adding network_profile to the VPC modules by @cdunbar13 in #3387
- TopologyAwareScheduling enabled by default for Kueue v0.9.1 by @sharabiani in #3396
Improvements 🛠
- Implemented kueue tests by @ighosh98 in #3315
- add support for kueue v0.9.1 by @ighosh98 in #3321
- Bump jobset version to 0.7.1 by @annuay-google in #3318
- Update custom TAS scripts to support A3U by @ighosh98 in #3295
- integrate tas plugin bug fixes by @ighosh98 in #3339
- Add multi-mount parallelstore support by @harshthakkar01 in #3256
- Update image-builder.yaml link in README by @ighosh98 in #3373
- Update OpenFOAM tutorial by @wkharold in #3342
- [Cherry Pick] add reservations for kueue integration tests by @ighosh98 in #3431
- [Cherry Pick] Update README and related test setup for GKE managed parallelstore blueprint by @mohitchaurasia91 in #3437
Version Updates ⏫
- Promote the new nic-types in vm-instance by @cdunbar13 in #3288
Bug fixes 🐞
- Fix default ssd config to ephemeral storage by @ankitkinra in #3317
- Set 'enable_private_endpoint' to false by @pawloch00 in #3364
Full Changelog: v1.43.1...v1.44.0
v1.43.1: Patch version bump in OFE
What's Changed
Version Updates ⏫
- Bump django from 4.2.16 to 4.2.17 in /community/front-end/ofe by @dependabot in #3358
Full Changelog: v1.43.0...v1.43.1
v1.43.0: GKE and networking enhancements
What's Changed
Key New Features 🎉
- add support for kueue v0.9.0 to enable Topology Aware Scheduling by @ighosh98 in #3277
- RDMA networking and multi-zone cluster support for GKE A3-Ultra by @annuay-google in #3299
- Improved reservations support for GKE A3-Ultra by @annuay-google in #3298
Module Improvements 🔨
- Add cloud rdma drivers into startup script module by @abbas1902 in #3289
- add support for enable DCGM monitoring in GKE by @chengcongdu in #3279
- add GKE support for node local dns by @chengcongdu in #3280
- Update topology-scheduler-scripts.yaml by @thisSIDEofRANDOM in #3286
Improvements 🛠
- add firewall to allow tcp traffic for parallelstore by @chengcongdu in #3262
- Allow specifying GKE's system node pool disk properties by @ankitkinra in #3268
- Adds Cluster Toolkit Dockerfile for backend integration with XPK by @RachaelSTamakloe in #3237
- Add cluster and hostname as cloud ops labels by @abbas1902 in #3163
- Print job detail for gke-storage-parallelstore integration test by @mohitchaurasia91 in #3264
- Add max_pods_per_node for GKE cluster and nodepool by @pawloch00 in #3197
- Fix image-building.md link by @jemish-google in #3287
- Updating Cluster Toolkit Dockerfile README by @RachaelSTamakloe in #3290
- Updating Go version in Cluster Toolkit Dockerfile by @RachaelSTamakloe in #3301
- Add Integration Test to Cluster Toolkit Dockerfile by @RachaelSTamakloe in #3302
- XPK blueprint updates to make it more compatible with the tool by @ankitkinra in #3192
- expanding subnetwork_cidr_suffix by @ighosh98 in #3304
- Updating Slurm-GCP to 6.8.6 by @cdunbar13 in #3336
Bug fixes 🐞
- Revert "update a3 machines local ssd to use nvme instead of scsi for better performance" by @chengcongdu in #3272
- remove GKE reservation validation for local ssd NVMe/CSCI interface by @chengcongdu in #3281
New Contributors
- @pawloch00 made their first contribution in #3197
- @thisSIDEofRANDOM made their first contribution in #3286
- @jemish-google made their first contribution in #3287
Full Changelog: v1.42.0...v1.43.0
v1.42.0: Filestore deletion protection, GCP maintenance as Slurm job, Docker daemon configuration
What's Changed
Key New Features 🎉
- Add support for custom Docker daemon configuration by @tpdownes in #3201
- Adopt local SSD storage for A3 docker images by @tpdownes in #3206
- Adopt google Terraform plugin v6.10.0 and drop support for 5.x by @tpdownes in #3189
- Add support to perform GCP maintenance as slurm job by @harshthakkar01 in #3152
- Add support for Filestore deletion protection by @tpdownes in #3183
Module Improvements 🔨
- Updating notebook module to use workbench_instance by @jrossthomson in #3139
- Initial commit for new logging output by @cdunbar13 in #3150
- SlurmGCP. "All or nothing" bulk insert on requests with placements by @mr0re1 in #3157
- Remove redundant provisioner for printing image name by @cdunbar13 in #3151
- Add direct Terraform support for Slurm SchedulerParameters and PrivateData by @tpdownes in #3164
- Add
use_job_duration
option by @abbas1902 in #3142 - Improvements for CloudSQL by @wiktorn in #3147
- Improve Error Message with Reservation Validation by @arajmane-g in #3174
Improvements 🛠
- Use local paths to embedded modules throughout Toolkit by @tpdownes in #3102
- Update default value for subnetwork_project to null by @alyssa-sm in #3193
- Gke update default taints for user node pools by @ankitkinra in #3200
- Update MTU for a3 mega for GKE based on best practices by @ankitkinra in #3175
- add training example for gke parallelstore blueprint by @chengcongdu in #3181
- Update maintenance.py to support additional format by @alyssa-sm in #3208
- Allow latest Terraform google plugin by @tpdownes in #3213
- update a3 machines local ssd to use nvme instead of scsi for better performance by @chengcongdu in #3232
- Improve fetching and caching job details by @harshthakkar01 in #3194
- SlurmGCP. Add
set -e
to prolog mux by @mr0re1 in #3215 - add gpu health check in prolog and epilog by @NinaCai in #3134
Deprecations 💤
- Delete the new-project module to support adoption of TPG v6 by @RachaelSTamakloe in #3171
- Delete Daos Example Blueprints to support adoption of TPG v6 by @RachaelSTamakloe in #3172
Version Updates ⏫
- Bump integration test to support Go 1.23 by @mohitchaurasia91 in #3154
- Bump go version 1.21 -> 1.22 by @mohitchaurasia91 in #3156
- Update bucket module within Slurm controller module by @tpdownes in #3161
- update vm-instance module to support TPG v6 by @RachaelSTamakloe in #3166
- Update IP address module within VPC module by @tpdownes in #3160
- update Batch module to be compatible with TPG v6 by @RachaelSTamakloe in #3187
- update HTCondor modules to be compatible with TPG v6 by @RachaelSTamakloe in #3186
- Update Slurm-GCP v5 to 5.12.1 by @tpdownes in #3185
- Update workload-identity submodule from v29 to v34 by @RachaelSTamakloe in #3196
- Update ml-slurm examples to use recent copies of pytorch and tensorflow by @tpdownes in #3226
- Make gke-node-pool compatible with TPG 6.x by @tpdownes in #3230
Bug fixes 🐞
- Refactor mount/mode setting for local SSD RAID by @tpdownes in #3214
- Fix a bug where try was hiding extraction of gpu driver version by @ankitkinra in #3257
- Fix the gpu_installation_config default for case where no customer input by @ankitkinra in #3259
- SlurmGCP. Fix bug that prevents resourcePolicies clean up. by @mr0re1 in #3266
New Contributors
- @linsword13 made their first contribution in #3211
- @NinaCai made their first contribution in #3134
Full Changelog: v1.41.0...v1.42.0
v1.41.0 Adoption of Slurm 24.05 and Improvements to GKE Support
What's Changed
Key New Features 🎉
New Modules 🧱
- resource-policy module implemented by @sharabiani in #3066
- gke-topology-scheduler module implemented by @sharabiani in #3080
- add GKE support for parallelstore through gke-storage module by @chengcongdu in #3120
Module Improvements 🔨
- Added compatibility check for GPUDirect and GKE version by @sharabiani in #3079
- Support template file for kueue configuration in kubectl-apply module by @sharabiani in #3111
- Implement xpk-gke-a3-megagpu blueprint by @sharabiani in #3108
- Use sackd for the login nodes by @mr0re1 in #3126
- gke-node-pool default name conflict fixed by @sharabiani in #3127
- improve dws_flex ux by @abbas1902 in #3122
- Include deployment name in Spack and Ramble bucket names (like startup-script) by @rohitramu in #3136
Improvements 🛠
- Create and use non-default service accounts in GKE by @annuay-google in #3123
- Added documentation on cloud-ops-agent installation and stackdriver removal by @jrossthomson in #3029
- Ensure local SSD filesystem is assembled into a RAID even upon power off/on cycles by @tpdownes in #3129
Deprecations 💤
- Freeze slurm-gcp v5 hybrid blueprints with the latest cluster toolkit version support by @harshthakkar01 in #3117
- Update Slurm-gcp v5 deprecation details by @harshthakkar01 in #3118
- Update badge for slurm-gcp v5 and slurm-gcp v6 by @harshthakkar01 in #3116
Version Updates ⏫
- Update A3-High NeMo to 24.07 and NCCL solution to latest recommended values by @akiki-liang0 in #3130
- Update Slurm-GCP to 6.8.2 by @tpdownes in #3132
Bug fixes 🐞
- Fixed the exact number constraint problem for additional vpcs in gpu_direct checks by @sharabiani in #3078
- Provide explicit project information by @wiktorn in #3060
- Chrome Remote Desktop: increase resilience of apt operations by @tpdownes in #3093
- Add mount parallelstore service to mount parallelstore for every reboot by @harshthakkar01 in #3125
New Contributors
- @akiki-liang0 made their first contribution in #3130
- @ighosh98 made their first contribution in #3124
Full Changelog: v1.40.1...v1.41.0
v1.40.1 Fix issue that affected GKE blueprints due to dynamic provisioning
What's Changed
Other changes
- Revert PR#3046 and add more line breaks for readability by @ankitkinra in #3115
Full Changelog: v1.40.0...v1.40.1