Skip to content

Commit

Permalink
Merge pull request #600 from GoogleCloudPlatform/release-candidate
Browse files Browse the repository at this point in the history
Release v1.6.0
  • Loading branch information
heyealex authored Oct 4, 2022
2 parents 6e8e1f7 + 8878f3f commit 54270c1
Show file tree
Hide file tree
Showing 141 changed files with 1,433 additions and 563 deletions.
12 changes: 11 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,21 @@ ENG = ./cmd/... ./pkg/...
TERRAFORM_FOLDERS=$(shell find ./modules ./community/modules ./tools -type f -name "*.tf" -not -path '*/\.*' -exec dirname "{}" \; | sort -u)
PACKER_FOLDERS=$(shell find ./modules ./community/modules ./tools -type f -name "*.pkr.hcl" -not -path '*/\.*' -exec dirname "{}" \; | sort -u)

ifneq (, $(shell which git))
## GIT IS PRESENT
ifneq (,$(wildcard .git))
## GIT DIRECTORY EXISTS
GIT_TAG_VERSION=$(shell git tag --points-at HEAD)
GIT_BRANCH=$(shell git branch --show-current)
GIT_COMMIT_INFO=$(shell git describe --tags --dirty --long)
endif
endif

# RULES MEANT TO BE USED DIRECTLY

ghpc: warn-go-version warn-terraform-version warn-packer-version $(shell find ./cmd ./pkg ghpc.go -type f)
$(info **************** building ghpc ************************)
go build ghpc.go
@go build -ldflags="-X 'main.gitTagVersion=$(GIT_TAG_VERSION)' -X 'main.gitBranch=$(GIT_BRANCH)' -X 'main.gitCommitInfo=$(GIT_COMMIT_INFO)'" ghpc.go

install-user:
$(info ******** installing ghpc in ~/bin *********************)
Expand Down
53 changes: 53 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -204,6 +204,59 @@ In the right side, expand the Filters view and then filter by label, specifying

## Troubleshooting

### Network is unreachable (Slurm V5)

Slurm requires access to google APIs to function. This can be achieved through one of the following methods:

1. Create a [Cloud NAT](https://cloud.google.com/nat) (preferred).
2. Setting `disable_controller_public_ips: false` &
`disable_login_public_ips: false` on the controller and login nodes
respectively.
3. Enable
[private access to Google APIs](https://cloud.google.com/vpc/docs/private-access-options).

By default the Toolkit VPC module will create an associated Cloud NAT so this is
typically seen when working with the pre-existing-vpc module. If no access
exists you will see the following errors:

When you ssh into the login node or controller you will see the following
message:

```text
*** Slurm setup failed! Please view log: /slurm/scripts/setup.log ***
```

> **_NOTE:_**: Many different potential issues could be indicated by the above
> message, so be sure to verify issue in logs.

To confirm the issue, ssh onto the controller and call `sudo cat /slurm/scripts/setup.log`. Look for
the following logs:

```text
google_metadata_script_runner: startup-script: ERROR: [Errno 101] Network is unreachable
google_metadata_script_runner: startup-script: OSError: [Errno 101] Network is unreachable
google_metadata_script_runner: startup-script: ERROR: Aborting setup...
google_metadata_script_runner: startup-script exit status 0
google_metadata_script_runner: Finished running startup scripts.
```

You may also notice mount failure logs on the login node:

```text
INFO: Waiting for '/usr/local/etc/slurm' to be mounted...
INFO: Waiting for '/home' to be mounted...
INFO: Waiting for '/opt/apps' to be mounted...
INFO: Waiting for '/etc/munge' to be mounted...
ERROR: mount of path '/usr/local/etc/slurm' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/usr/local/etc/slurm']' returned non-zero exit status 32.
ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
ERROR: mount of path '/etc/munge' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/etc/munge']' returned non-zero exit status 32.
```

> **_NOTE:_**: The above logs only indicate that something went wrong with the
> startup of the controller. Check logs on the controller to be sure it is a
> network issue.

### Failure to Create Auto Scale Nodes (Slurm)

If your deployment succeeds but your jobs fail with the following error:
Expand Down
28 changes: 26 additions & 2 deletions cmd/root.go
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,16 @@ import (
"github.com/spf13/cobra"
)

// Git references when use Makefile
var (
rootCmd = &cobra.Command{
GitTagVersion string
GitBranch string
GitCommitInfo string
)

var (
annotation = make(map[string]string)
rootCmd = &cobra.Command{
Use: "ghpc",
Short: "A blueprint and deployment engine for HPC clusters in GCP.",
Long: `gHPC provides a flexible and simple to use interface to accelerate
Expand All @@ -34,12 +42,28 @@ HPC deployments on the Google Cloud Platform.`,
log.Fatalf("cmd.Help function failed: %s", err)
}
},
Version: "v1.5.0",
Version: "v1.6.0",
Annotations: annotation,
}
)

// Execute the root command
func Execute() error {
if len(GitCommitInfo) > 0 {
if len(GitTagVersion) == 0 {
GitTagVersion = "- not built from oficial release"
}
if len(GitBranch) == 0 {
GitBranch = "detached HEAD"
}
annotation["version"] = GitTagVersion
annotation["branch"] = GitBranch
annotation["commitInfo"] = GitCommitInfo
rootCmd.SetVersionTemplate(`ghpc version {{index .Annotations "version"}}
Built from '{{index .Annotations "branch"}}' branch.
Commit info: {{index .Annotations "commitInfo"}}
`)
}
return rootCmd.Execute()
}

Expand Down
9 changes: 2 additions & 7 deletions community/examples/AMD/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,20 +75,15 @@ remounted and that you should logout and login. Follow its instructions.
Once configuration is complete, install AOCC by running:
```shell
sudo -i bash /var/tmp/install_aocc.sh
sudo bash /var/tmp/install_aocc.sh
```
Spack will prompt you to accept the AOCC End User License Agreement by opening a
text file containing information about the license. Leave the file unmodified
and write it to disk by typing `:q` as two characters in sequence
([VI help][vihelp]).
Installation of AOCC and OpenMPI will take approximately 15 minutes. Once they
are installed, you can install additional packages such as `amdblis`:
```shell
sudo -i spack -d install -v amdblis %[email protected]
```
Installation of AOCC and OpenMPI will take approximately 15 minutes.
Configure SSH user keys for access between cluster nodes:
Expand Down
28 changes: 27 additions & 1 deletion community/examples/AMD/hpc-cluster-amd-slurmv5.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -65,20 +65,46 @@ deployment_groups:
- type: shell
source: modules/startup-script/examples/install_ansible.sh
destination: install_ansible.sh
- $(swfs.install_nfs_client_runner)
- $(swfs.mount_runner)
- $(spack.install_spack_deps_runner)
- $(spack.install_spack_runner)
- type: shell
content: "shutdown -h +1"
destination: shutdown.sh

- id: slurm_startup
source: modules/scripts/startup-script
settings:
runners:
- type: data
destination: /etc/profile.d/spack.sh
content: |
#!/bin/sh
if [ -f /sw/spack/share/spack/setup-env.sh ]; then
. /sw/spack/share/spack/setup-env.sh
fi
# the following installation of AOCC may be automated in the future
# with a clear direction to the user to read the EULA at
# https://developer.amd.com/aocc-compiler-eula/
- type: data
destination: /var/tmp/install_aocc.sh
content: |
#!/bin/bash
source /sw/spack/share/spack/setup-env.sh
spack install [email protected] +license-agreed
spack load [email protected]
spack compiler find --scope site
spack -d install -v [email protected] %[email protected] +legacylaunchers +pmi schedulers=slurm
# must restart vm to re-initiate subsequent installs
- id: spack_builder
source: modules/compute/vm-instance
use: [network1, swfs, spack-startup]
settings:
name_prefix: spack-builder
machine_type: c2d-standard-16

- id: low_cost_partition
source: community/modules/compute/schedmd-slurm-gcp-v5-partition
use:
Expand Down Expand Up @@ -118,6 +144,6 @@ deployment_groups:
use:
- network1
- slurm_controller
- spack-startup
- slurm_startup
settings:
machine_type: c2d-standard-4
5 changes: 0 additions & 5 deletions community/examples/cloud-batch.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,17 +29,14 @@ deployment_groups:
modules:
- id: network1
source: modules/network/pre-existing-vpc
kind: terraform

- id: appfs
source: modules/file-system/filestore
kind: terraform
use: [network1]
settings: {local_mount: /sw}

- id: hello-startup-script
source: modules/scripts/startup-script
kind: terraform
settings:
runners:
- type: shell
Expand All @@ -55,7 +52,6 @@ deployment_groups:
- id: batch-job
source: community/modules/scheduler/cloud-batch-job
kind: terraform
use: [network1, appfs, hello-startup-script]
settings:
runnable: "cat /sw/hello.txt"
Expand All @@ -66,6 +62,5 @@ deployment_groups:

- id: batch-login
source: community/modules/scheduler/cloud-batch-login-node
kind: terraform
use: [batch-job]
outputs: [instructions]
6 changes: 0 additions & 6 deletions community/examples/hpc-cluster-small-sharedvpc.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -43,15 +43,13 @@ deployment_groups:
modules:
- id: network1
source: modules/network/pre-existing-vpc
kind: terraform
settings:
project_id: $(vars.host_project_id)
network_name: your-shared-network
subnetwork_name: your-shared-subnetwork

- id: homefs
source: modules/file-system/filestore
kind: terraform
use: [network1]
settings:
local_mount: /home
Expand All @@ -61,7 +59,6 @@ deployment_groups:
# This debug_partition will work out of the box without requesting additional GCP quota.
- id: debug_partition
source: community/modules/compute/SchedMD-slurm-on-gcp-partition
kind: terraform
use:
- network1
- homefs
Expand All @@ -75,7 +72,6 @@ deployment_groups:
# This compute_partition is far more performant than debug_partition but may require requesting GCP quotas first.
- id: compute_partition
source: community/modules/compute/SchedMD-slurm-on-gcp-partition
kind: terraform
use:
- network1
- homefs
Expand All @@ -85,7 +81,6 @@ deployment_groups:

- id: slurm_controller
source: community/modules/scheduler/SchedMD-slurm-on-gcp-controller
kind: terraform
use:
- network1
- homefs
Expand All @@ -97,7 +92,6 @@ deployment_groups:

- id: slurm_login
source: community/modules/scheduler/SchedMD-slurm-on-gcp-login-node
kind: terraform
use:
- network1
- homefs
Expand Down
10 changes: 0 additions & 10 deletions community/examples/htcondor-pool.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,6 @@ deployment_groups:
modules:
- id: network1
source: modules/network/vpc
kind: terraform
settings:
network_name: htcondor-pool
subnetwork_name: htcondor-pool-usc1
Expand All @@ -38,21 +37,17 @@ deployment_groups:

- id: htcondor_install
source: community/modules/scripts/htcondor-install
kind: terraform

- id: htcondor_services
source: community/modules/project/service-enablement
kind: terraform
use:
- htcondor_install

- id: htcondor_configure
source: community/modules/scheduler/htcondor-configure
kind: terraform

- id: htcondor_configure_central_manager
source: modules/scripts/startup-script
kind: terraform
settings:
runners:
- type: shell
Expand All @@ -63,7 +58,6 @@ deployment_groups:

- id: htcondor_cm
source: modules/compute/vm-instance
kind: terraform
use:
- network1
- htcondor_configure_central_manager
Expand All @@ -80,7 +74,6 @@ deployment_groups:

- id: htcondor_configure_execute_point
source: modules/scripts/startup-script
kind: terraform
settings:
runners:
- type: shell
Expand All @@ -91,7 +84,6 @@ deployment_groups:

- id: htcondor_execute_point
source: community/modules/compute/htcondor-execute-point
kind: terraform
use:
- network1
- htcondor_configure_execute_point
Expand All @@ -106,7 +98,6 @@ deployment_groups:

- id: htcondor_configure_access_point
source: modules/scripts/startup-script
kind: terraform
settings:
runners:
- type: shell
Expand All @@ -130,7 +121,6 @@ deployment_groups:
queue
- id: htcondor_access
source: modules/compute/vm-instance
kind: terraform
use:
- network1
- htcondor_configure_access_point
Expand Down
3 changes: 0 additions & 3 deletions community/examples/intel/daos-cluster.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -30,14 +30,12 @@ deployment_groups:
modules:
- id: network1
source: modules/network/pre-existing-vpc
kind: terraform

# This module creates a DAOS server. Server images MUST be created before running this.
# https://github.com/daos-stack/google-cloud-daos/tree/main/images
# more info: https://github.com/daos-stack/google-cloud-daos/tree/main/terraform/modules/daos_server
- id: daos-server
source: github.com/daos-stack/google-cloud-daos.git//terraform/modules/daos_server?ref=v0.2.1
kind: terraform
use: [network1]
settings:
number_of_instances: 2
Expand All @@ -48,7 +46,6 @@ deployment_groups:
# more info: https://github.com/daos-stack/google-cloud-daos/tree/main/terraform/modules/daos_client
- id: daos-client
source: github.com/daos-stack/google-cloud-daos.git//terraform/modules/daos_client?ref=v0.2.1
kind: terraform
use: [network1, daos-server]
settings:
number_of_instances: 2
Expand Down
Loading

0 comments on commit 54270c1

Please sign in to comment.