Skip to content

Releases: GoogleCloudPlatform/cluster-toolkit

v1.14.0: HTCondor highly available, HCLS blueprint

07 Mar 21:40
e7c0c24
Compare
Choose a tag to compare

Key New Features

  • HCLS blueprint supports running GROMACS on GPUs and has added several tutorials.
  • Support for highly available HTCondor pools
    • Job queue (SchedD) high availability remains experimental see README

Module Improvements

  • vpc: new option to enable firewall rule that allows tunneling of Windows Remote Desktop connections
  • schedmd-slurm-gcp-v5-partition: all deprecated variables have been removed; these have migrated to schedmd-slurm-gcp-v5-node-group
  • htcondor-configure:
    • job history will now include VM instance ID, zone and machine type
    • VMs are now provisioned with minimally-permissioned IDTOKENs for their respective daemons (e.g. ADVERTISE_STARTD)
  • startup-script: installation script for Cloud Ops Agent on Debian platforms will retry when other processes are blocking apt operations
  • htcondor-execute-point: add a simple health check of port 9618 on any machine within the execute point
  • vm-instance: vm-instance can be named using both a prefix and the deployment name

Improvements

  • Improved error message when YAML blueprint has syntax errors preventing it from being loaded
  • Regular updates to Go and Python dependencies to address potential security vulnerabilities
  • Fixed Open Front End (OFE) issue with static content (icons) not displaying properly

What's Changed

  • Add Windows Remote Desktop IAP firewall rule by @tpdownes in #885
  • Bump oauthlib from 3.2.1 to 3.2.2 in /community/front-end/ofe by @dependabot in #886
  • Fix: having the same share name and local mount caused slurm failure by @nick-stroud in #887
  • Implicitly add outputs to modules when they are being used across deployment groups by @tpdownes in #878
  • Reorder validator list to test blueprint correctness first by @heyealex in #889
  • Bump cryptography from 36.0.1 to 39.0.1 in /community/front-end/ofe by @dependabot in #890
  • Adding example sbatch and config for Factor Xa protein by @nick-stroud in #888
  • Update HCLS blueprint examples to run Gromacs w/ GPUs by @nick-stroud in #891
  • Add support for highly available HTCondor Central Managers by @tpdownes in #892
  • Remove deprecations from slurm-gcp v5 partition by @heyealex in #893
  • Bump cryptography from 36.0.1 to 39.0.1 in /community/front-end/ofe by @tpdownes in #896
  • Bump cryptography from 36.0.1 to 39.0.1 in /community/front-end/ofe by @dependabot in #894
  • Ensure Open Front End dependabot updates target develop by @tpdownes in #897
  • Fix: Always generate Batch instance template to avoid known at apply time error by @nick-stroud in #898
  • Update sbatch to copy results to output bucket & minor tweaks by @nick-stroud in #895
  • Add troubleshooting documentation for filestore share name exportfs bug by @nick-stroud in #899
  • Bump github.com/hashicorp/hcl/v2 from 2.16.0 to 2.16.1 by @dependabot in #900
  • Fix addlicense check in weekly image building by @tpdownes in #901
  • Update HTCondor modules by @tpdownes in #902
  • Bump django from 3.2.16 to 3.2.17 in /community/front-end/ofe by @dependabot in #905
  • Bump cryptography from 36.0.1 to 39.0.1 in /community/front-end/ofe by @dependabot in #906
  • Bump oauthlib from 3.2.1 to 3.2.2 in /community/front-end/ofe by @dependabot in #907
  • Bump TFLint version in image to latest by @tpdownes in #903
  • Add instructions to ssh to VM created by vm-instance by @nick-stroud in #880
  • Address terraform_unused_required_providers errors by @tpdownes in #908
  • Manage HTCondor yum repo configuration directly by @tpdownes in #904
  • Bump github.com/aws/aws-sdk-go from 1.33.0 to 1.34.0 by @dependabot in #911
  • Fail integration tests on validation warnings. by @mr0re1 in #910
  • Set minimal scopes for HTCondor IDTOKENs by @tpdownes in #919
  • Fix cmd/root_test.go test runs from linked Git worktrees. by @mr0re1 in #918
  • OFE update 14/02/2023. by @ek-nag in #913
  • Improve error message for yaml parsing failures by @heyealex in #923
  • HTCondor job track machine information by @tpdownes in #924
  • Update develop with release v1.13.0 by @nick-stroud in #928
  • Add explicit output dependencies to HTCondor by @tpdownes in #925
  • Pin terraform google provider to v4.53.1 by @nick-stroud in #929
  • Fix root_test failure on MacOS by @mr0re1 in #932
  • Update htcondor-configure README example snippet by @tpdownes in #935
  • Bump github.com/hashicorp/go-getter from 1.6.2 to 1.7.0 by @dependabot in #930
  • Add retries to cloud ops install by @heyealex in #933
  • Bump django from 3.2.17 to 3.2.18 in /community/front-end/ofe by @dependabot in #922
  • Remove unused error message by @tpdownes in #939
  • Bump google.golang.org/api from 0.109.0 to 0.110.0 by @dependabot in #937
  • Fix typo in hcls instructions command by @nick-stroud in #940
  • Update hcls example to use lysozyme protein instead of factor xa by @nick-stroud in #942
  • HTCondor Job Queue High Availability by @tpdownes in #934
  • Refactor useModule by @tpdownes in #941
  • Bugfix to HTCondor autoscaler script by @tpdownes in #945
  • Update hcls spack builder to use c2 machine by @nick-stroud in #948
  • Enable OS Login by default in HTCondor execute points by @tpdownes in #944
  • Remove dependency on unused module by @mr0re1 in #947
  • Add health check for HTCondor VMs by @tpdownes in #946
  • HCLS tutorial update by @nick-stroud in #950
  • Bump github.com/spf13/afero from 1.9.3 to 1.9.4 by @dependabot in #955
  • Bump github.com/zclconf/go-cty from 1.12.1 to 1.13.0 by @dependabot in #954
  • Allow VM instance name to include prefix and deployment name by @nick-stroud in #949
  • Use HTCondor Python bindings in autoscaler by @tpdownes in #951
  • Update hcls Lysozyme example to include visualization instructions by @nick-stroud in #958
  • Unify validator...
Read more

v1.13.0: HCLS Example Blueprint, New Chrome Remote Desktop Module, & Image Building Improvements

16 Feb 20:42
dda341f
Compare
Choose a tag to compare

Key New Features

Other Improvements

Version updates

  • install-htcondor: Update HTCondor release to current 10.x series.
  • Slurm on GCP updated to version 5.4.1 (PR).
  • Google Terraform Provider pinned to version 4.51.0 and will be bumped with each future release.

What's Changed

New Contributors

Full Changelog: v1.12.0...v1.13.0

v1.12.0: Google Cloud Storage module and Fluent Tutorial

31 Jan 21:29
73fb63f
Compare
Choose a tag to compare

Key New Features

New Resources

Improvements

  • Improved documentation and module automation for GPUs support.
  • Various improvements in the ghpc engine code.

Bug Fixes

  • Fixed error when ghpc was run outside of the HPC-Toolkit folder (PR).
  • Fixed category-field bug preventing some users from deploying HPC monitoring dashboards.

Version updates

  • DAOS examples updated to use google-cloud-daos v0.3.0.
  • Slurm on GCP updated to version 5.4.0 (PR).
  • Updating cloud.google.com/go/compute from 1.15.0 to 1.15.1.
  • Updated google.golang.org/api from 0.106.0 to 0.108.0.
  • Google Terraform Provider pinned to version 4.49.0 and will be bumped with each future release.

What's Changed

New Contributors

Full Changelog: v1.11.0...v1.12.0

v1.11.0: Usability Improvements for GPUs, Validation of `use` Field, & Miscellaneous Slurm Improvements

18 Jan 21:55
d706498
Compare
Choose a tag to compare

Key New Features

  • GPU type and count is automatically populated when using A2 series machines for vm-instance and Slurm v5 node-group, controller, & login-node.
  • ghpc now validates that modules linked using the use field have common outputs and settings.

Resource Improvements

Version updates

  • Google Terraform Provider pinned to version 4.48.0 and will be bumped with each future release.

What's Changed

  • Strip newlines chars before searching for startup failure by @nick-stroud in #783
  • Allow local disk labels, merged with var.labels by @heyealex in #764
  • Add PBS Pro integration test by @tpdownes in #782
  • Add a validator for unused modules in the "use" list by @heyealex in #760
  • Bump google.golang.org/api from 0.104.0 to 0.105.0 by @dependabot in #786
  • Fix Quantum AI example by pinning to g++ 10 instead of 11 by @tpdownes in #788
  • Make URL for deployment tarball cut-and-paste-able. by @tpdownes in #789
  • Add no_comma_params option in "cloud_parameters" by @heyealex in #765
  • Add partition level startup script variables by @heyealex in #785
  • Add slurm gcp v5 integration test with startup scripts by @heyealex in #790
  • Increase HTCondor installation timeout by @tpdownes in #792
  • Run dependabot at repeatable weekly interval by @tpdownes in #793
  • Add disable_public_ips option to node group module by @heyealex in #791
  • Sourcereader wrapfs workaround by @thiagosgobe in #748
  • Updating develop post-release of 1.10.1 by @nick-stroud in #801
  • Add option to set instance template for login and controller slurm-gcp v5 modules by @heyealex in #787
  • Remove singularity install to avoid failed package install by @heyealex in #804
  • Bump github.com/go-git/go-billy/v5 from 5.3.1 to 5.4.0 by @dependabot in #802
  • Automate gpu guest accelerator in vm-instance if not set by @heyealex in #805
  • Bump google.golang.org/api from 0.105.0 to 0.106.0 by @dependabot in #806
  • Apply gpu_definition to slurm modules by @heyealex in #807
  • Bump cloud.google.com/go/serviceusage from 1.4.0 to 1.5.0 by @dependabot in #809
  • Bump cloud.google.com/go/compute from 1.14.0 to 1.15.0 by @dependabot in #808
  • Modification to handling of django key to keep it local only to webse… by @mattstreet-nag in #755
  • Pass family through instead of looking up image to allow compute nodes to pick up new version within family by @nick-stroud in #810
  • Fix broken link to application tutorial diagram by @nick-stroud in #813
  • Enable Slurm v4 image to be specified by name by @nick-stroud in #814
  • Add slash to network storage output for pre-existing file systems by @heyealex in #812
  • Bump google provider max version to 4.47.0 by @cboneti in #818
  • Enable cleanup of active compute nodes on destroy for high io test by @nick-stroud in #819
  • Bump github.com/aws/aws-sdk-go from 1.15.78 to 1.33.0 by @dependabot in #821
  • Move directory check to after embedded checks by @heyealex in #822
  • Adding support for existing GCS bucket in startup script module. by @soumyapani in #820
  • Update TF google provider version to 4.48 by @heyealex in #823
  • Rolling version to 1.11.0 by @nick-stroud in #841
  • Release v1.11.0 by @nick-stroud in #837
  • Allow Dependabot YAML parser to read time value by @tpdownes in #842

New Contributors

Full Changelog: v1.10.1...v1.11.0

v1.10.1: Update to Slurm v5.3, Bug Fixes, Documentation Updates

22 Dec 00:02
3c03c9a
Compare
Choose a tag to compare

Key New Features

  • All Slurm v5 modules have been updated from v5.2.0 -> v5.3.0. For more information, see the changelog for Slurm on GCP.

Improvements

What's Changed

  • Cleanup examples README by @heyealex in #752
  • Update login and controller to use standard image setting format by @heyealex in #754
  • Remove duplicated module and example lists in community READMEs by @heyealex in #750
  • Bump cloud.google.com/go/compute from 1.12.1 to 1.14.0 by @dependabot in #759
  • Update guidance to use incremental placement to avoid deadlock by @nick-stroud in #766
  • Remove outdated warning in node_groups variable by @heyealex in #763
  • Always include a startup script with a pre-determined name even if script is empty string by @nick-stroud in #777
  • Always include a startup script with a pre-determined name even if script is empty string by @nick-stroud in #778
  • Merge main into develop after release 1.10.0 by @cboneti in #780
  • Configure dependency review by @cboneti in #781
  • Bump oauthlib from 3.2.0 to 3.2.1 in /community/front-end/ofe by @dependabot in #769
  • Bump pyjwt from 2.3.0 to 2.4.0 in /community/front-end/ofe by @dependabot in #770
  • Bump django from 3.2.12 to 3.2.16 in /community/front-end/ofe by @dependabot in #771
  • Bump protobuf from 3.19.4 to 3.19.5 in /community/front-end/ofe by @dependabot in #772
  • Bump google.golang.org/api from 0.103.0 to 0.104.0 by @dependabot in #774
  • Bump certifi from 2021.10.8 to 2022.12.7 in /community/front-end/ofe by @dependabot in #779
  • Update hybrid docs to conform to 5.3.0 by @heyealex in #794
  • Update slurm v5.3.0 by @heyealex in #795
  • Rolling google terraform provider version to 4.46.0 by @nick-stroud in #797
  • Fix Quantum AI example by pinning to g++ 10 instead of 11 (known failure) by @nick-stroud in #799
  • Rolling the Toolkit version to 1.10.1 by @nick-stroud in #798
  • Version 1.10.1 by @nick-stroud in #796

Full Changelog: v1.10.0...v1.10.1

v1.10.0: Open Front End and new Batch MPI example

07 Dec 20:49
5693e89
Compare
Choose a tag to compare

Key New Features

  • Open Front-End Web UI added in community/front-end/ofe
  • New Batch MPI Example running WRF

Version updates

  • spack-install: default spack version updated from v0.18.0 to v0.19.0.

Improvements

  • Fixed a bug where ghpc would exit with an error but with rc=0 instead of rc=1 when failing to overwrite a deployment folder.
  • New integration tests.
  • Improved documentation and documentation links.
  • Now fixing a google cloud terraform provider to the last stable version

Bug Fixes

  • nfs-server: Fixed bug when deploying with multiple mount points that share the same destination filenames
  • wait-for-startup: Timeouts now properly reported (vs previous unknown errors)

What's Changed

New Contributors

Full Changelog: v1.9.0...v1.10.0

v1.9.0: Altair PBS Pro, Core Support for Batch, Simplified Network Storage

11 Nov 23:39
0f0f70e
Compare
Choose a tag to compare

Key New Features

New Resources

  • schedmd-slurm-gcp-v5-node-group: Support modules for defining one or more node groups used in defining a schedmd-slurm-gcp-v5-partition.
  • PBS Pro Modules:
    • pbspro-execution: Provisions one or more PBS execution hosts to run jobs in a PBS Professional cluster.
    • pbspro-client: Provisions one or more PBS Client hosts to submit jobs to a PBS Professional cluster.
    • pbspro-server: Provisions a PBS Server Host to operate and administer a PBS Professional cluster.
    • pbspro-install: Creates Toolkit runners that download PBS Pro RPM packages and installs them with configuration settings as documented in the PBS Pro "Big Book".
    • pbspro-preinstall: Uploads PBS Pro RPM packages and, optionally, a license file to Google Cloud Storage.
    • pbspro-qmgr: Creates a Toolkit runner that performs the following administrative PBS configurations on a PBS server.

Resource Improvements

Version updates

Deprecations

v1.8.0: Improved startup-script automation, multiple network interfaces in vm-instance, escapes for variable characters

02 Nov 17:44
78bb2bd
Compare
Choose a tag to compare

Key New Features

  • Ansible install script is automatically installed if it's detected as a dependency of other runners.
  • Multiple network interfaces can be added in vm-instance.
  • Ability to escape variable characters in module settings.
  • Remote filesystems now supply client installation and mounting scripts
  • Remote filesystem mounting scripts no longer depend upon Ansible, significantly reducing time before filesystems are available

Resource Improvements

  • vm-instance: Support for multiple network interfaces.
  • startup-script: Ansible installation script automatically included when other runners depend on it.

Improvements

  • Escape variable characters: ”\$(...)” evaluates to ”$(...)”

What's Changed

New Contributors

Full Changelog: v1.7.0...v1.8.0

v1.7.0: Improved blueprint validation, official support for Ubuntu, and bug fixes for Slurm v5

19 Oct 19:40
f4ed7c1
Compare
Choose a tag to compare

Key New Features

Improvements

  • Batch modules now support Shared VPC networks
  • VPC module enables jumbo frames by default
  • AMD-optimized blueprint includes the OpenFOAM application compiled for the Zen 3 microarchitecture
  • A new example blueprint demonstrates using local SSD disks with VM instances
  • A new example blueprint demonstrates installation of StarCCM+ CFD simulation application

Bug Fixes

  • Resolve Slurm v5 startup-script timeout errors
    by demonstrating the use of a build VM to install Spack and optimized applications
  • Fix incompatibility between DDN Exascaler (Lustre) and Slurm v5 modules

What's Changed

New Contributors

Full Changelog: v1.6.0...v1.7.0

v1.6.0: DDN-EXAScaler update and improved functionality, Source modules from generic git repos

04 Oct 23:32
54270c1
Compare
Choose a tag to compare

Key New Features

  • DDN-EXAScaler module version update and further support added for DDN-EXAScaler with other modules.
  • Import modules from gitlab and other generic git repositories.

Resource Improvements

Version updates

Improvements

  • git commit and branch information included when running ghpc --version.
  • Shell runners are run as an executable rather than sourced.
  • Documentation Slurm on GCP hybrid demo using a cloud based Slurm controller.
  • hpc-cluster-amd-slurmv5.yaml example uses a builder VM for spack installation.
  • Ability to import modules from generic git repositories with the git:: prefix in source.

Deprecations

  • In the variable image of DDN-EXAScaler, name is no longer supported in favor of project and family.

Bugfixes

  • Fixed startup script failure in HTCondor autoscaler configuration

What's Changed

New Contributors

Full Changelog: v1.5.0...v1.6.0