v1.43.0: GKE and networking enhancements
What's Changed
Key New Features 🎉
- add support for kueue v0.9.0 to enable Topology Aware Scheduling by @ighosh98 in #3277
- RDMA networking and multi-zone cluster support for GKE A3-Ultra by @annuay-google in #3299
- Improved reservations support for GKE A3-Ultra by @annuay-google in #3298
Module Improvements 🔨
- Add cloud rdma drivers into startup script module by @abbas1902 in #3289
- add support for enable DCGM monitoring in GKE by @chengcongdu in #3279
- add GKE support for node local dns by @chengcongdu in #3280
- Update topology-scheduler-scripts.yaml by @thisSIDEofRANDOM in #3286
Improvements 🛠
- add firewall to allow tcp traffic for parallelstore by @chengcongdu in #3262
- Allow specifying GKE's system node pool disk properties by @ankitkinra in #3268
- Adds Cluster Toolkit Dockerfile for backend integration with XPK by @RachaelSTamakloe in #3237
- Add cluster and hostname as cloud ops labels by @abbas1902 in #3163
- Print job detail for gke-storage-parallelstore integration test by @mohitchaurasia91 in #3264
- Add max_pods_per_node for GKE cluster and nodepool by @pawloch00 in #3197
- Fix image-building.md link by @jemish-google in #3287
- Updating Cluster Toolkit Dockerfile README by @RachaelSTamakloe in #3290
- Updating Go version in Cluster Toolkit Dockerfile by @RachaelSTamakloe in #3301
- Add Integration Test to Cluster Toolkit Dockerfile by @RachaelSTamakloe in #3302
- XPK blueprint updates to make it more compatible with the tool by @ankitkinra in #3192
- expanding subnetwork_cidr_suffix by @ighosh98 in #3304
- Updating Slurm-GCP to 6.8.6 by @cdunbar13 in #3336
Bug fixes 🐞
- Revert "update a3 machines local ssd to use nvme instead of scsi for better performance" by @chengcongdu in #3272
- remove GKE reservation validation for local ssd NVMe/CSCI interface by @chengcongdu in #3281
New Contributors
- @pawloch00 made their first contribution in #3197
- @thisSIDEofRANDOM made their first contribution in #3286
- @jemish-google made their first contribution in #3287
Full Changelog: v1.42.0...v1.43.0