Releases: GoogleCloudPlatform/cluster-toolkit
Releases · GoogleCloudPlatform/cluster-toolkit
v1.14.0: HTCondor highly available, HCLS blueprint
Key New Features
- HCLS blueprint supports running GROMACS on GPUs and has added several tutorials.
- Support for highly available HTCondor pools
- Job queue (SchedD) high availability remains experimental see README
Module Improvements
vpc
: new option to enable firewall rule that allows tunneling of Windows Remote Desktop connectionsschedmd-slurm-gcp-v5-partition
: all deprecated variables have been removed; these have migrated toschedmd-slurm-gcp-v5-node-group
htcondor-configure
:- job history will now include VM instance ID, zone and machine type
- VMs are now provisioned with minimally-permissioned IDTOKENs for their respective daemons (e.g. ADVERTISE_STARTD)
startup-script
: installation script for Cloud Ops Agent on Debian platforms will retry when other processes are blocking apt operationshtcondor-execute-point
: add a simple health check of port 9618 on any machine within the execute pointvm-instance
: vm-instance can be named using both a prefix and the deployment name
Improvements
- Improved error message when YAML blueprint has syntax errors preventing it from being loaded
- Regular updates to Go and Python dependencies to address potential security vulnerabilities
- Fixed Open Front End (OFE) issue with static content (icons) not displaying properly
What's Changed
- Add Windows Remote Desktop IAP firewall rule by @tpdownes in #885
- Bump oauthlib from 3.2.1 to 3.2.2 in /community/front-end/ofe by @dependabot in #886
- Fix: having the same share name and local mount caused slurm failure by @nick-stroud in #887
- Implicitly add outputs to modules when they are being used across deployment groups by @tpdownes in #878
- Reorder validator list to test blueprint correctness first by @heyealex in #889
- Bump cryptography from 36.0.1 to 39.0.1 in /community/front-end/ofe by @dependabot in #890
- Adding example sbatch and config for Factor Xa protein by @nick-stroud in #888
- Update HCLS blueprint examples to run Gromacs w/ GPUs by @nick-stroud in #891
- Add support for highly available HTCondor Central Managers by @tpdownes in #892
- Remove deprecations from slurm-gcp v5 partition by @heyealex in #893
- Bump cryptography from 36.0.1 to 39.0.1 in /community/front-end/ofe by @tpdownes in #896
- Bump cryptography from 36.0.1 to 39.0.1 in /community/front-end/ofe by @dependabot in #894
- Ensure Open Front End dependabot updates target develop by @tpdownes in #897
- Fix: Always generate Batch instance template to avoid known at apply time error by @nick-stroud in #898
- Update sbatch to copy results to output bucket & minor tweaks by @nick-stroud in #895
- Add troubleshooting documentation for filestore share name exportfs bug by @nick-stroud in #899
- Bump github.com/hashicorp/hcl/v2 from 2.16.0 to 2.16.1 by @dependabot in #900
- Fix addlicense check in weekly image building by @tpdownes in #901
- Update HTCondor modules by @tpdownes in #902
- Bump django from 3.2.16 to 3.2.17 in /community/front-end/ofe by @dependabot in #905
- Bump cryptography from 36.0.1 to 39.0.1 in /community/front-end/ofe by @dependabot in #906
- Bump oauthlib from 3.2.1 to 3.2.2 in /community/front-end/ofe by @dependabot in #907
- Bump TFLint version in image to latest by @tpdownes in #903
- Add instructions to ssh to VM created by vm-instance by @nick-stroud in #880
- Address terraform_unused_required_providers errors by @tpdownes in #908
- Manage HTCondor yum repo configuration directly by @tpdownes in #904
- Bump github.com/aws/aws-sdk-go from 1.33.0 to 1.34.0 by @dependabot in #911
- Fail integration tests on validation warnings. by @mr0re1 in #910
- Set minimal scopes for HTCondor IDTOKENs by @tpdownes in #919
- Fix
cmd/root_test.go
test runs from linked Git worktrees. by @mr0re1 in #918 - OFE update 14/02/2023. by @ek-nag in #913
- Improve error message for yaml parsing failures by @heyealex in #923
- HTCondor job track machine information by @tpdownes in #924
- Update develop with release v1.13.0 by @nick-stroud in #928
- Add explicit output dependencies to HTCondor by @tpdownes in #925
- Pin terraform google provider to v4.53.1 by @nick-stroud in #929
- Fix root_test failure on MacOS by @mr0re1 in #932
- Update htcondor-configure README example snippet by @tpdownes in #935
- Bump github.com/hashicorp/go-getter from 1.6.2 to 1.7.0 by @dependabot in #930
- Add retries to cloud ops install by @heyealex in #933
- Bump django from 3.2.17 to 3.2.18 in /community/front-end/ofe by @dependabot in #922
- Remove unused error message by @tpdownes in #939
- Bump google.golang.org/api from 0.109.0 to 0.110.0 by @dependabot in #937
- Fix typo in hcls instructions command by @nick-stroud in #940
- Update hcls example to use lysozyme protein instead of factor xa by @nick-stroud in #942
- HTCondor Job Queue High Availability by @tpdownes in #934
- Refactor useModule by @tpdownes in #941
- Bugfix to HTCondor autoscaler script by @tpdownes in #945
- Update hcls spack builder to use c2 machine by @nick-stroud in #948
- Enable OS Login by default in HTCondor execute points by @tpdownes in #944
- Remove dependency on unused module by @mr0re1 in #947
- Add health check for HTCondor VMs by @tpdownes in #946
- HCLS tutorial update by @nick-stroud in #950
- Bump github.com/spf13/afero from 1.9.3 to 1.9.4 by @dependabot in #955
- Bump github.com/zclconf/go-cty from 1.12.1 to 1.13.0 by @dependabot in #954
- Allow VM instance name to include prefix and deployment name by @nick-stroud in #949
- Use HTCondor Python bindings in autoscaler by @tpdownes in #951
- Update hcls Lysozyme example to include visualization instructions by @nick-stroud in #958
- Unify validator...
v1.13.0: HCLS Example Blueprint, New Chrome Remote Desktop Module, & Image Building Improvements
Key New Features
- New Healthcare and Life Sciences (HCLS) example running GROMACS on GPUs.
- New experimental
chrome-remote-desktop
module creates a GPU accelerated virtual machine that can be accessed using Chrome Remote Desktop. - Improvements to
custom-image
so startup script failures now result in a failed build and several additional features.
Other Improvements
vm_instance
: Scope changed tocloud-platform
.
Version updates
install-htcondor
: Update HTCondor release to current 10.x series.- Slurm on GCP updated to version 5.4.1 (PR).
- Google Terraform Provider pinned to version 4.51.0 and will be bumped with each future release.
What's Changed
- Improve expandVariables error message by @thiagosgobe in #838
- Use keyword must be explicit across groups by @tpdownes in #852
- Improve unit test coverage of config package by @tpdownes in #857
- Add security vulnerability reporting policy by @heyealex in #853
- Update HTCondor release to current 10.x series by @tpdownes in #856
- Update varReference implementation and usage by @tpdownes in #858
- Remote desktop module - CRD baseline by @saltysoup in #811
- Bump SchedMD community v5 modules to 5.4.1 from 5.4.0 by @SkylerMalinowski in #862
- Add name_prefix to chrome-remote-desktop by @nick-stroud in #863
- Change vm-instance scope to cloud-platform to follow scopes best practice by @nick-stroud in #864
- Update pre-commit hooks by @tpdownes in #866
- Adopt Packer googlecompute plugin 1.1.0 by @tpdownes in #867
- Bump cloud.google.com/go/compute from 1.15.1 to 1.18.0 by @dependabot in #865
- main to develop for release of 1.12.0 by @cboneti in #872
- Improve Packer module by @tpdownes in #870
- Pinning Google terraform provider to 4.51.0 by @cboneti in #875
- Eliminate warnings and align variable naming by @tpdownes in #873
- Adds HCLS Blueprint and deploy instructions by @nick-stroud in #871
- Rename comment sections to align with architecture naming by @nick-stroud in #879
- Improve readability of useModule by @tpdownes in #877
- Add links to hybrid docs from other modules by @heyealex in #876
- Bump google.golang.org/api from 0.108.0 to 0.109.0 by @dependabot in #884
- Bump github.com/hashicorp/hcl/v2 from 2.15.0 to 2.16.0 by @dependabot in #883
- Bump django from 3.2.16 to 3.2.17 in /community/front-end/ofe by @dependabot in #881
- Bump cryptography from 36.0.1 to 39.0.1 in /community/front-end/ofe by @nick-stroud in #914
- Bump oauthlib from 3.2.1 to 3.2.2 in /community/front-end/ofe by @nick-stroud in #915
- Rolling version to 1.13.0 by @nick-stroud in #921
- Release v1.13.0 by @nick-stroud in #916
New Contributors
- @saltysoup made their first contribution in #811
Full Changelog: v1.12.0...v1.13.0
v1.12.0: Google Cloud Storage module and Fluent Tutorial
Key New Features
- Added
google-storage-bucket
allowing to create Google Cloud Storage buckets. - New Fluent tutorial blueprint .
New Resources
google-storage-bucket
: Creates a Google Cloud Storage bucket.
Improvements
- Improved documentation and module automation for GPUs support.
- Various improvements in the
ghpc
engine code.
Bug Fixes
- Fixed error when
ghpc
was run outside of the HPC-Toolkit folder (PR). - Fixed category-field bug preventing some users from deploying HPC monitoring dashboards.
Version updates
- DAOS examples updated to use google-cloud-daos v0.3.0.
- Slurm on GCP updated to version 5.4.0 (PR).
- Updating cloud.google.com/go/compute from 1.15.0 to 1.15.1.
- Updated google.golang.org/api from 0.106.0 to 0.108.0.
- Google Terraform Provider pinned to version 4.49.0 and will be bumped with each future release.
What's Changed
- Bump cloud.google.com/go/compute from 1.15.0 to 1.15.1 by @dependabot in #824
- Refactor reference ID function by @tpdownes in #828
- Eliminate unnecessary guard around map access warning (S1036) by @tpdownes in #829
- Initial support for intergroup variable references by @tpdownes in #831
- Inject build id into Omnia test, drop to last build by @heyealex in #832
- Run config validation test in a different directory than the repo, fix bugs found when making the change by @heyealex in #827
- Bump google.golang.org/api from 0.106.0 to 0.107.0 by @dependabot in #835
- Remove category field from HPC dashboard json by @heyealex in #834
- Changed DAOS community examples to use v0.3.0 terraform modules by @mark-olson in #833
- main to develop for release of v1.11.0 by @nick-stroud in #844
- Pinning google terraform provider to 4.49.0 by @nick-stroud in #847
- GCS bucket module by @nick-stroud in #836
- Updates for slurm-gcp 5.4.0 by @SkylerMalinowski in #843
- Increasing default login node disk size on the ofe by @ek-nag in #848
- Add validation logic for references to outputs from prior modules by @tpdownes in #840
- Add documentation for accelerator support in the HPC Toolkit by @heyealex in #839
- Bump google.golang.org/api from 0.107.0 to 0.108.0 by @dependabot in #850
- Add Fluent Tutorial to Develop by @jrossthomson in #855
- Add security vulnerability reporting policy by @heyealex in #859
- Rolling version to 1.12.0 by @cboneti in #869
- Release v1.12.0 by @cboneti in #868
New Contributors
- @mark-olson made their first contribution in #833
- @ek-nag made their first contribution in #848
Full Changelog: v1.11.0...v1.12.0
v1.11.0: Usability Improvements for GPUs, Validation of `use` Field, & Miscellaneous Slurm Improvements
Key New Features
- GPU
type
andcount
is automatically populated when using A2 series machines forvm-instance
and Slurm v5 node-group, controller, & login-node. ghpc
now validates that modules linked using theuse
field have common outputs and settings.
Resource Improvements
startup-script
: Added option to use an existing Google Cloud Storage bucket.- Slurm v5 partition: Now supports partition level startup scripts.
- Slurm v5 controller & login-node: Now has the option to use instance templates.
- Slurm v5 node-group, controller, & login-node: Now supports labels on disks.
- Slurm v5 node-group: Added option to enable public IPs.
- Slurm v4 partition: Compute nodes pick up new version within image family.
- Slurm v4 partition, controller, & login-node: Image can now be specified by name, in addition to family.
Version updates
- Google Terraform Provider pinned to version 4.48.0 and will be bumped with each future release.
What's Changed
- Strip newlines chars before searching for startup failure by @nick-stroud in #783
- Allow local disk labels, merged with var.labels by @heyealex in #764
- Add PBS Pro integration test by @tpdownes in #782
- Add a validator for unused modules in the "use" list by @heyealex in #760
- Bump google.golang.org/api from 0.104.0 to 0.105.0 by @dependabot in #786
- Fix Quantum AI example by pinning to g++ 10 instead of 11 by @tpdownes in #788
- Make URL for deployment tarball cut-and-paste-able. by @tpdownes in #789
- Add no_comma_params option in "cloud_parameters" by @heyealex in #765
- Add partition level startup script variables by @heyealex in #785
- Add slurm gcp v5 integration test with startup scripts by @heyealex in #790
- Increase HTCondor installation timeout by @tpdownes in #792
- Run dependabot at repeatable weekly interval by @tpdownes in #793
- Add disable_public_ips option to node group module by @heyealex in #791
- Sourcereader wrapfs workaround by @thiagosgobe in #748
- Updating develop post-release of 1.10.1 by @nick-stroud in #801
- Add option to set instance template for login and controller slurm-gcp v5 modules by @heyealex in #787
- Remove singularity install to avoid failed package install by @heyealex in #804
- Bump github.com/go-git/go-billy/v5 from 5.3.1 to 5.4.0 by @dependabot in #802
- Automate gpu guest accelerator in vm-instance if not set by @heyealex in #805
- Bump google.golang.org/api from 0.105.0 to 0.106.0 by @dependabot in #806
- Apply gpu_definition to slurm modules by @heyealex in #807
- Bump cloud.google.com/go/serviceusage from 1.4.0 to 1.5.0 by @dependabot in #809
- Bump cloud.google.com/go/compute from 1.14.0 to 1.15.0 by @dependabot in #808
- Modification to handling of django key to keep it local only to webse… by @mattstreet-nag in #755
- Pass family through instead of looking up image to allow compute nodes to pick up new version within family by @nick-stroud in #810
- Fix broken link to application tutorial diagram by @nick-stroud in #813
- Enable Slurm v4 image to be specified by name by @nick-stroud in #814
- Add slash to network storage output for pre-existing file systems by @heyealex in #812
- Bump google provider max version to 4.47.0 by @cboneti in #818
- Enable cleanup of active compute nodes on destroy for high io test by @nick-stroud in #819
- Bump github.com/aws/aws-sdk-go from 1.15.78 to 1.33.0 by @dependabot in #821
- Move directory check to after embedded checks by @heyealex in #822
- Adding support for existing GCS bucket in startup script module. by @soumyapani in #820
- Update TF google provider version to 4.48 by @heyealex in #823
- Rolling version to 1.11.0 by @nick-stroud in #841
- Release v1.11.0 by @nick-stroud in #837
- Allow Dependabot YAML parser to read time value by @tpdownes in #842
New Contributors
- @soumyapani made their first contribution in #820
Full Changelog: v1.10.1...v1.11.0
v1.10.1: Update to Slurm v5.3, Bug Fixes, Documentation Updates
Key New Features
- All Slurm v5 modules have been updated from v5.2.0 -> v5.3.0. For more information, see the changelog for Slurm on GCP.
Improvements
- Standardize Slurm V5
instance_image
variable format. Previous method of image definition is deprecated but will continue to work for the time being. - Recommended compact placement for vm-instance updated for improved usability.
What's Changed
- Cleanup examples README by @heyealex in #752
- Update login and controller to use standard image setting format by @heyealex in #754
- Remove duplicated module and example lists in community READMEs by @heyealex in #750
- Bump cloud.google.com/go/compute from 1.12.1 to 1.14.0 by @dependabot in #759
- Update guidance to use incremental placement to avoid deadlock by @nick-stroud in #766
- Remove outdated warning in node_groups variable by @heyealex in #763
- Always include a startup script with a pre-determined name even if script is empty string by @nick-stroud in #777
- Always include a startup script with a pre-determined name even if script is empty string by @nick-stroud in #778
- Merge main into develop after release 1.10.0 by @cboneti in #780
- Configure dependency review by @cboneti in #781
- Bump oauthlib from 3.2.0 to 3.2.1 in /community/front-end/ofe by @dependabot in #769
- Bump pyjwt from 2.3.0 to 2.4.0 in /community/front-end/ofe by @dependabot in #770
- Bump django from 3.2.12 to 3.2.16 in /community/front-end/ofe by @dependabot in #771
- Bump protobuf from 3.19.4 to 3.19.5 in /community/front-end/ofe by @dependabot in #772
- Bump google.golang.org/api from 0.103.0 to 0.104.0 by @dependabot in #774
- Bump certifi from 2021.10.8 to 2022.12.7 in /community/front-end/ofe by @dependabot in #779
- Update hybrid docs to conform to 5.3.0 by @heyealex in #794
- Update slurm v5.3.0 by @heyealex in #795
- Rolling google terraform provider version to 4.46.0 by @nick-stroud in #797
- Fix Quantum AI example by pinning to g++ 10 instead of 11 (known failure) by @nick-stroud in #799
- Rolling the Toolkit version to 1.10.1 by @nick-stroud in #798
- Version 1.10.1 by @nick-stroud in #796
Full Changelog: v1.10.0...v1.10.1
v1.10.0: Open Front End and new Batch MPI example
Key New Features
- Open Front-End Web UI added in
community/front-end/ofe
- New Batch MPI Example running WRF
Version updates
spack-install
: default spack version updated from v0.18.0 to v0.19.0.
Improvements
- Fixed a bug where
ghpc
would exit with an error but with rc=0 instead of rc=1 when failing to overwrite a deployment folder. - New integration tests.
- Improved documentation and documentation links.
- Now fixing a google cloud terraform provider to the last stable version
Bug Fixes
nfs-server
: Fixed bug when deploying with multiple mount points that share the same destination filenameswait-for-startup
: Timeouts now properly reported (vs previous unknown errors)
What's Changed
- Update pbspro-preinstall by @tpdownes in #722
- Add Batch MPI example running WRF by @nick-stroud in #718
- Merge main into develop after release 1.9.0 by @heyealex in #728
- Increase omnia timeout by @cboneti in #742
- Handling timeouts in wait for startup by @cboneti in #743
- Cap TF Google provider version at latest stable (4.43) by @heyealex in #744
- make overwrite error exit with rc=1 by @kkr16 in #701
- Add Slurm on GCP v5 High IO example by @heyealex in #730
- Add the HPC High IO with Slurm on GCP v5 example to the integration tests by @heyealex in #731
- Add reference to htcondor tutorial from example documentation by @nick-stroud in #747
- Default enable reconfig high io by @heyealex in #734
- Fix for issue #746 - removed invalid options for filesystem (STANDARD… by @mattstreet-nag in #749
- Update max google provider to 4.44.1 by @heyealex in #751
- Rolling version to 1.10.0 by @cboneti in #762
- Version 1.10.0 by @cboneti in #767
New Contributors
- @mattstreet-nag made their first contribution in #749
Full Changelog: v1.9.0...v1.10.0
v1.9.0: Altair PBS Pro, Core Support for Batch, Simplified Network Storage
Key New Features
- Initial support for Altair PBS Professionsal
- Batch modules promoted from experimental to core support
- Mounting of network storage via
use
field. See full documentation. - GCS Fuse Support with
pre-existing-network-storage
.
New Resources
schedmd-slurm-gcp-v5-node-group
: Support modules for defining one or more node groups used in defining aschedmd-slurm-gcp-v5-partition
.- PBS Pro Modules:
pbspro-execution
: Provisions one or more PBS execution hosts to run jobs in a PBS Professional cluster.pbspro-client
: Provisions one or more PBS Client hosts to submit jobs to a PBS Professional cluster.pbspro-server
: Provisions a PBS Server Host to operate and administer a PBS Professional cluster.pbspro-install
: Creates Toolkit runners that download PBS Pro RPM packages and installs them with configuration settings as documented in the PBS Pro "Big Book".pbspro-preinstall
: Uploads PBS Pro RPM packages and, optionally, a license file to Google Cloud Storage.pbspro-qmgr
: Creates a Toolkit runner that performs the following administrative PBS configurations on a PBS server.
Resource Improvements
- Core support for Batch modules:
batch-job-template
: Use Native Batch mounting for nfs.batch-login-node
: Support for multiple batch jobs used by a single Batch login node.schedmd-slurm-gcp-v5-partition
:- Explicit zone preference option.
- Support for multiple node groups per partition.
pre-existing-network-storage
: GCS Fuse support
Version updates
Deprecations
- Variables in
schedmd-slurm-gcp-v5-partition
that are now inschedmd-slurm-gcp-v5-node-group
are now deprecated and cannot be used for defining a default node group. - Batch modules moved location. Blueprints referencing the old
source
location will have to be updated.
v1.8.0: Improved startup-script automation, multiple network interfaces in vm-instance, escapes for variable characters
Key New Features
- Ansible install script is automatically installed if it's detected as a dependency of other runners.
- Multiple network interfaces can be added in
vm-instance
. - Ability to escape variable characters in module settings.
- Remote filesystems now supply client installation and mounting scripts
- Remote filesystem mounting scripts no longer depend upon Ansible, significantly reducing time before filesystems are available
Resource Improvements
vm-instance
: Support for multiple network interfaces.startup-script
: Ansible installation script automatically included when other runners depend on it.
Improvements
- Escape variable characters:
”\$(...)”
evaluates to”$(...)”
What's Changed
- Remove Batch startup script workaround by @nick-stroud in #613
- develop - Handle case that cloud-batch-job does not have startup script in metadata by @nick-stroud in #615
- Improve Batch output instructions and add README to login node by @nick-stroud in #616
- Fix link to the hpc-cluster-localssd example by @heyealex in #620
- Fix EXAScaler issue with Slurm-on-GCP V5 by @cboneti in #622
- Add additional network interfaces to vm-instance by @heyealex in #614
- Added Simcenter StarCCM+ tutorial blueprint. by @jrossthomson in #626
- Refactor mount shell script to take args by @nick-stroud in #625
- Fix: logs were not being captured for invalid return status by @nick-stroud in #630
- Add end-of-file-fixer pre-commit hook to enforce trailing newlines by @tpdownes in #632
- Bump github.com/googleapis/gax-go/v2 from 2.5.1 to 2.6.0 by @dependabot in #635
- Bump google.golang.org/api from 0.98.0 to 0.99.0 by @dependabot in #634
- Add info on connecting VPC networks to vm-instance by @heyealex in #627
- Update pre-commit hooks by @tpdownes in #633
- Install Ansible automatically when other runners depend upon it by @tpdownes in #629
- Bump github.com/spf13/cobra from 1.5.0 to 1.6.0 by @dependabot in #619
- Escape variables capability by @thiagosgobe in #604
- Mount script also makes entry in /etc/fstab by @nick-stroud in #628
- Improve Ansible linting by @tpdownes in #636
- Merge release 1.6.0 back to develop branch by @tpdownes in #638
- Add mounting runners to network_storage output by @nick-stroud in #631
- Bump github.com/zclconf/go-cty from 1.11.0 to 1.11.1 by @dependabot in #641
- Improve and document mount.sh by @nick-stroud in #643
- Bump google.golang.org/api from 0.99.0 to 0.100.0 by @dependabot in #651
- Default to the compute engine service account if null is provided by @heyealex in #650
- Merge v1.7.0 release into develop by @tpdownes in #655
- Improve mounting for filestore and pre-existing-network-storage by @nick-stroud in #644
- Add pre-commit check to ensure duplicate file stay in sync by @nick-stroud in #648
- Remove remaining explicit installations of Ansible by @tpdownes in #658
- Add documentation for Slurm reconfigure option by @heyealex in #653
- Update version to 1.8.0 by @heyealex in #678
- Release v1.8.0 by @heyealex in #677
New Contributors
- @jrossthomson made their first contribution in #626
Full Changelog: v1.7.0...v1.8.0
v1.7.0: Improved blueprint validation, official support for Ubuntu, and bug fixes for Slurm v5
Key New Features
- A new test_apis_enabled validator will check whether Google Cloud services required by the blueprint are enabled
- The Cloud HPC Toolkit has added official support for Ubuntu 20.04 images
Improvements
- Batch modules now support Shared VPC networks
- VPC module enables jumbo frames by default
- AMD-optimized blueprint includes the OpenFOAM application compiled for the Zen 3 microarchitecture
- A new example blueprint demonstrates using local SSD disks with VM instances
- A new example blueprint demonstrates installation of StarCCM+ CFD simulation application
Bug Fixes
- Resolve Slurm v5 startup-script timeout errors
by demonstrating the use of a build VM to install Spack and optimized applications - Fix incompatibility between DDN Exascaler (Lustre) and Slurm v5 modules
What's Changed
- Add shared task for recover from terraform failure by @nick-stroud in #578
- Automatically pull serial port 1 logs on startup script failure by @nick-stroud in #580
- Bump github.com/hashicorp/hcl/v2 from 2.14.0 to 2.14.1 by @dependabot in #572
- Enable shared VPC to be used with Batch modules by @nick-stroud in #573
- Update default MTU for new VPCs by @tpdownes in #577
- Replace HTCondor integration test by @tpdownes in #583
- Remove blueprint_dir update in integration tests by @heyealex in #582
- Add documentation that cloud-batch-job supports shared vpc by @nick-stroud in #584
- Add Spack setup runner by @nick-stroud in #586
- Implement API validation by @tpdownes in #576
- Add documentation for API validation by @tpdownes in #585
- Document troubleshooting for Slurm startup script timeouts by @tpdownes in #590
- Bump github.com/zclconf/go-cty from 1.10.0 to 1.11.0 by @dependabot in #491
- Update wait for startup for qsim example to 2400s by @heyealex in #591
- Bump google.golang.org/api from 0.96.0 to 0.98.0 by @dependabot in #589
- Bump github.com/googleapis/gax-go/v2 from 2.4.0 to 2.5.1 by @dependabot in #588
- Fix test_apis_enabled to handle empty API list by @tpdownes in #592
- Add default Spack log file by @nick-stroud in #595
- Fix: passing non-existent cache was causing failures in some projects by @nick-stroud in #593
- Batch modules should require Batch API by @tpdownes in #597
- Support local module required APIs by @tpdownes in #596
- hybrid-slurm documentation for on-premise by @heyealex in #579
- Add ability to pass non-text variables through the cli
--vars
flag by @kkr16 in #574 - Additional Batch parameters by @nick-stroud in #594
- Add slurm v5 HPC Centos example to integration tests by @heyealex in #587
- Clarify and compact validator failure message by @tpdownes in #598
- Add slurm-gcp-v5-ubuntu2004 example to integration tests by @heyealex in #601
- Remove deployment_name var from vpc example as it is passed implicitly by @nick-stroud in #605
- Add variable to turn off public ips on Batch jobs by @nick-stroud in #609
- Update AMD blueprint with OpenFOAM application and test suite by @tpdownes in #606
- Filestore cleanup fails on existing instances by @heyealex in #610
- Develop by @omartin2010 in #608
- Add image support documentation by @heyealex in #611
- Push test deployment directory to GCS by @heyealex in #602
- release-candidate - Handle case that cloud-batch-job does not have startup script in metadata by @nick-stroud in #612
- Add a boilerplate blueprint to README by @nick-stroud in #617
- Back-porting EXAScaler fix for V5 on release candidate by @cboneti in #624
- Release Candidate: Simcenter StarCCM+ Tutorial Blueprint by @heyealex in #637
- Resolve conflicts on release-candidate by @nick-stroud in #646
- Release v1.7.0 by @tpdownes in #649
New Contributors
- @omartin2010 made their first contribution in #608
Full Changelog: v1.6.0...v1.7.0
v1.6.0: DDN-EXAScaler update and improved functionality, Source modules from generic git repos
Key New Features
- DDN-EXAScaler module version update and further support added for DDN-EXAScaler with other modules.
- Import modules from gitlab and other generic git repositories.
Resource Improvements
DDN-EXAScaler
: DDN EXAScaler Lustre will now work with vm-instance, cloud-batch-job, htcondor. More information can be found in the README.schedmd-slurm-gcp-v5-controller
: Exposeenable_reconfigure
from Slurm on GCP underlying module.SchedMD-slurm-on-gcp-controller
: Allow static nodes in partitions.vm-instance
: Addauto_delete_disk
variable.spack-install
: Improved performance and bug fixes.
Version updates
DDN-EXAScaler
: Version 6.1.0
Improvements
- git commit and branch information included when running
ghpc --version
. - Shell runners are run as an executable rather than sourced.
- Documentation Slurm on GCP hybrid demo using a cloud based Slurm controller.
hpc-cluster-amd-slurmv5.yaml
example uses a builder VM for spack installation.- Ability to import modules from generic git repositories with the
git::
prefix in source.
Deprecations
- In the variable
image
ofDDN-EXAScaler
,name
is no longer supported in favor ofproject
andfamily
.
Bugfixes
- Fixed startup script failure in HTCondor autoscaler configuration
What's Changed
- Add customized version output for debugging improvement by @thiagosgobe in #532
- Address permadiff in vm-instance module by @tpdownes in #542
- Exposing enable_reconfigure in Slurm-onGCP V5 by @cboneti in #537
- Adding Slurm on GCP V4 static nodes functionality by @cboneti in #544
- Set enable_smt default to false for slurm v5 modules by @heyealex in #545
- Run shell runners as executable by @nick-stroud in #543
- Add auto-delete boot disk as an option on vm-instance by @nick-stroud in #548
- Default slurm_cluster_name to deploy name in hybrid by @heyealex in #550
- Default scope now allows reading AND writing. by @sandwichmaker in #546
- Upgrade DDN-EXAScaler to v6.1.0 by @nick-stroud in #551
- Add troubleshooting for Slurm: network is unreachable by @nick-stroud in #549
- Add Epilog/Prolog scripts to install path in hybrid by @heyealex in #552
- Integrate DDN Lustre install script with startup-script by @nick-stroud in #553
- Warn users about deprecated 'name' argument for EXAScaler image by @nick-stroud in #555
- remove "kind:" from examples and docs where optional by @kkr16 in #547
- Bump cloud.google.com/go/compute from 1.9.0 to 1.10.0 by @dependabot in #554
- Address an idempotency in Spack install script by @tpdownes in #557
- Replace Spack installation in AMD example with a builder VM by @tpdownes in #558
- Eliminate 1 git checkout during Spack install by @tpdownes in #559
- Address dependency checker timeout failure by @tpdownes in #561
- Merge v1.5.0 release into develop by @tpdownes in #565
- Enable DDN Lustre client install with pre-existing-network-storage by @nick-stroud in #560
- Allow generic git sources by @heyealex in #564
- Add hybrid slurm demo example using 2 GCP projects. by @heyealex in #556
- Use fully-qualified Ansible module names where not used by @tpdownes in #567
- Perform regular cleanup of Filestore VPC network peerings by @tpdownes in #568
- Avoid spurious errors in Spack log by @tpdownes in #571
- Fix remove default urls by @tpdownes in #570
- Fix filestore peering network cleanup script by @tpdownes in #569
- Fix Ansible module for upgrading setuptools in HTCondor autoscaler by @tpdownes in #581
- Update version to 1.6.0 by @heyealex in #599
- Release v1.6.0 by @heyealex in #600
New Contributors
- @sandwichmaker made their first contribution in #546
Full Changelog: v1.5.0...v1.6.0