Skip to content

Releases: GoogleCloudPlatform/cluster-toolkit

v1.45.1 Fix XPK issues running glcuster in container environments

22 Jan 18:13
a9dd634
Compare
Choose a tag to compare

What's Changed

Bug fixes 🐞

Full Changelog: v1.45.0...v1.45.1

Release v1.45.0

15 Jan 23:54
79299a1
Compare
Choose a tag to compare

Highlights:

  • A3 Ultra GKE blueprints updated to use Kueue 0.10.0 and Jobset 0.7.2 which are now supported.
  • Module improvements to support GKE cluster deletion protection, default node pools with shielded instances, latest GKE version in Rapid channel for A3 Ultra clusters, configurable upgrade settings for node pools and managed hyperdisk support.
  • Example for running NVIDIA NeMo on a3-ultragpu-8g Slurm clusters

What's Changed

Key New Features 🎉

Module Improvements 🔨

Improvements 🛠

Deprecations 💤

Bug fixes 🐞

New Contributors

Full Changelog: v1.44.2...v1.45.0

v1.44.2: Fix for Slurm autoscaler support for future reservations

09 Jan 00:21
484da6e
Compare
Choose a tag to compare

What's Changed

Bug fixes 🐞

  • Hotfix: Slurm autoscaler support for future reservations by @tpdownes in #3508

Full Changelog: v1.44.1...v1.44.2

Release v1.44.1: Support for a3-ultragpu-8g VMs and GKE, Slurm clusters

30 Dec 23:36
346d015
Compare
Choose a tag to compare

Release notes v1.44.1

This release announces Toolkit support for the new A3 Ultra machine type from Google Cloud. This machine type includes 8 NVIDIA H200 GPUs each with dedicated CX-7 networking with RDMA support via RoCE.

The release includes 4 blueprints that maximize performance for the machine type:

  1. A simple Slurm blueprint provisioning A3 Ultra compute nodes with a shared Filestore /home
  2. A GKE blueprint that provisions an A3 Ultra compute node pool
  3. An advanced Slurm blueprint that additionally mounts a GCS bucket with performance-optimized caching settings for I/O and checkpointing.
  4. A blueprint that provisions A3 Ultra compute nodes as VM instances (no scheduler) with RDMA networking

Example solutions using NCCL are provided for blueprints running under a scheduler.

v1.44.0: Future Reservations in Slurm, Topology Aware GKE, Expanded GPU RDMA Support

19 Dec 22:55
6a19416
Compare
Choose a tag to compare

What's Changed

Key New Features 🎉

Module Improvements 🔨

Improvements 🛠

Version Updates ⏫

Bug fixes 🐞

Full Changelog: v1.43.1...v1.44.0

v1.43.1: Patch version bump in OFE

12 Dec 20:02
0a8385b
Compare
Choose a tag to compare

What's Changed

Version Updates ⏫

  • Bump django from 4.2.16 to 4.2.17 in /community/front-end/ofe by @dependabot in #3358

Full Changelog: v1.43.0...v1.43.1

v1.43.0: GKE and networking enhancements

05 Dec 06:57
7ca11fc
Compare
Choose a tag to compare

What's Changed

Key New Features 🎉

Module Improvements 🔨

Improvements 🛠

Bug fixes 🐞

  • Revert "update a3 machines local ssd to use nvme instead of scsi for better performance" by @chengcongdu in #3272
  • remove GKE reservation validation for local ssd NVMe/CSCI interface by @chengcongdu in #3281

New Contributors

Full Changelog: v1.42.0...v1.43.0

v1.42.0: Filestore deletion protection, GCP maintenance as Slurm job, Docker daemon configuration

20 Nov 19:27
1a1e22a
Compare
Choose a tag to compare

What's Changed

Key New Features 🎉

Module Improvements 🔨

Improvements 🛠

Deprecations 💤

Version Updates ⏫

Bug fixes 🐞

  • Refactor mount/mode setting for local SSD RAID by @tpdownes in #3214
  • Fix a bug where try was hiding extraction of gpu driver version by @ankitkinra in #3257
  • Fix the gpu_installation_config default for case where no customer input by @ankitkinra in #3259
  • SlurmGCP. Fix bug that prevents resourcePolicies clean up. by @mr0re1 in #3266

New Contributors

Full Changelog: v1.41.0...v1.42.0

v1.41.0 Adoption of Slurm 24.05 and Improvements to GKE Support

25 Oct 16:58
26fafe0
Compare
Choose a tag to compare

What's Changed

Key New Features 🎉

New Modules 🧱

Module Improvements 🔨

Improvements 🛠

  • Create and use non-default service accounts in GKE by @annuay-google in #3123
  • Added documentation on cloud-ops-agent installation and stackdriver removal by @jrossthomson in #3029
  • Ensure local SSD filesystem is assembled into a RAID even upon power off/on cycles by @tpdownes in #3129

Deprecations 💤

Version Updates ⏫

Bug fixes 🐞

  • Fixed the exact number constraint problem for additional vpcs in gpu_direct checks by @sharabiani in #3078
  • Provide explicit project information by @wiktorn in #3060
  • Chrome Remote Desktop: increase resilience of apt operations by @tpdownes in #3093
  • Add mount parallelstore service to mount parallelstore for every reboot by @harshthakkar01 in #3125

New Contributors

Full Changelog: v1.40.1...v1.41.0

v1.40.1 Fix issue that affected GKE blueprints due to dynamic provisioning

10 Oct 01:20
eb00254
Compare
Choose a tag to compare

What's Changed

Other changes

  • Revert PR#3046 and add more line breaks for readability by @ankitkinra in #3115

Full Changelog: v1.40.0...v1.40.1