Skip to content

Latest commit

 

History

History
314 lines (242 loc) · 13 KB

faq.md

File metadata and controls

314 lines (242 loc) · 13 KB

Frequently Asked Questions (FAQ)

Slurm FAQ

For Management

Why should I use Slurm or other Free Open Source Software (FOSS)?

https://slurm.schedmd.com/faq.html#foss

Free Open Source Software (FOSS) does not mean that it is without cost. It does mean that the you have access to the code so that you are free to use it, study it, and/or enhance it. These reasons contribute to Slurm (and FOSS in general) being subject to active research and development worldwide, displacing proprietary software in many environments. If the software is large and complex, like Slurm or the Linux kernel, then while there is no license fee, its use is not without cost.

Why should I use slurm-gcp?

This is the official and supported solution from SchedMD in partnership with Google for Slurm on Google Cloud Platform.

slurm-gcp provides terraform modules. This make standing up a cluster easy and will integrate into your existing infrastructure.

How do I get support for slurm-gcp and Slurm?

Please visit SchedMD Support and reach out. Tickets can be submitted via SchedMD's Bugzilla.

For Users

Where can I find the Slurm logs?

  • Check the GCP Console Logs Viewer.
  • On Slurm cloud nodes, check /var/log/slurm/*.log.
  • Otherwise check /var/log/messages (RHEL/CentOS) or /var/log/syslog (Debian/Ubuntu).

How do I enable additional logging for Slurm-GCP?

  1. Enable debug logging using terraform enable_debug_logging variable.
  2. If you need more, such as verbose GCP API request information, enable the appropriate logging flag using terrafor extra_logging_flags variable (See the logging_flags variable in scripts/util.py to get the list of supported log flags).
    • For verbose API request information, use the trace_api logging flag.
  3. These increase the logging to Slurm-GCP script logs only, such as resume.log and suspend.log.

How do I move data for a job?

Data can be migrated to and from external sources using a workflow of dependent jobs. A workflow submission script and helper jobs are provided. See README for more information.

How do I connect to Slurm instances?

  • If the compute nodes have external IPs you can connect directly to the compute nodes. From the VM Instances page, the SSH drop down next to the compute instances gives several options for connecting to the compute nodes.

  • With IAP enabled, you can SSH to the nodes regardless of external IPs or not.

  • Use Slurm to get an allocation on the nodes.

    For Example:

    $ srun --pty $SHELL
    [g1-debug-test-0 ~]$
    

For Administrators

How do I contribute to slurm-gcp or slurm?

Enhancement requests can be submitted to SchedMD's Bugzilla.

How do I use Terraform?

Please see Terraform documentation.

For the Slurm terraform modules, please refer to their module API as documented in their README's. Additionally, please see the Slurm terraform examples for sample usage.

How do I modify Slurm config files?

Presuming slurm_cluster terraform module was used to deploy the cluster, see input parameters:

  • slurm_conf_tpl
  • cgroup_conf_tpl
  • slurmdbd_conf_tpl

What are GCP preemptible VMs?

Preemptible instances are cheaper than on-demand instances, however they can be reclaimed given their Service Level Agreement (SLA). Google Cloud offers two types of preemptible VMs: preemptible (v1); spot (beta). Spot VMs offer more features and better control over the reclaim process and when they can be reclaimed.

As far as Slurm is concerned, all preemptible type instances are treated the same. When reclaimed (terminated or stopped), they are marked as "down" and their running jobs are requeued, otherwise canceled. slurmsync will detect this activity and clear the "down" state from the node so it may be allocated jobs again.

How do I reduce compute costs?

  • In partition_conf, set a lower SuspendTime for a given slurm_partition.

    For example:

    partition_conf = {
      SuspendTime = 120
    }
  • For compute nodes within a given slurm_partition, use preemptible VM instances.

    For example:

    partition_nodes = [
      {
        ...
        preemptible = true
        ...
      }
    ]
  • For compute nodes within a given slurm_partition, use SPOT VM instances.

    For example:

    partition_nodes = [
      {
        ...
        enable_spot_vm = true
        ...
      }
    ]

How do I limit user access to only using login nodes?

By default, all instances are configured with OS Login. This keeps UID and GID of users consistent across all instances and allows easy user control with IAM Roles.

  1. Create a group for all users in admin.google.com.
  2. At the project level in IAM, grant the Compute Viewer and Service Account User roles to the group.
  3. At the instance level for each login node, grant the Compute OS Login role to the group.
  4. Make sure the Info Panel is shown on the right.
  5. On the compute instances page, select the boxes to the left of the login nodes.
  6. Click Add Members and add the Compute OS Login role to the group.
  7. At the organization level, grant the Compute OS Login External User role to the group if the users are not part of the organization.
  8. To allow ssh to login nodes without external IPs, configure IAP for the group.
  9. Go to the Identity-Aware Proxy page
  10. Select project
  11. Click SSH AND TCP RESOURCES tab
  12. Select boxes for login nodes
  13. Add group as a member with the IAP-secured Tunnel User role. Please see Enabling IAP for Compute Engine for more information.

What Slurm image do I use for production?

By default, the slurm_cluster terraform module uses the latest Slurm image family (e.g. slurm-gcp-6-8-hpc-rocky-linux-8). As new Slurm image families are released, coenciding with periodic Slurm releases, the terraform module will be updated to track the newest image family by setting it as the new default. This update can be considered a breaking change.

In a production setting, it is recommended to explicitly set an image family. Doing so will prevent slurm-gcp changes to the default image family from negatively impacting your cluster. Moreover, the controller and all other instances may be force replaced (destroyed, then deployed) when terraform apply detects that the image family of Slurm instances has changed.

Optionally, you may generate and use your own Slurm images. See custom image creation for more information.

What operating systems can I use slurm-gcp with?

You may use any OS supported by the image build process.

See image docs for more information.

Should I disable Simultaneous Multithreading (SMT)?

https://cloud.google.com/architecture/best-practices-for-using-mpi-on-compute-engine#disable_simultaneous_multithreading

Some HPC applications get better performance by disabling Simultaneous Multithreading (SMT) in the guest OS. Simultaneous Multithreading, commonly known as Intel Hyper-threading, allocates two virtual cores (vCPU) per physical core on the node. For many general computing tasks or tasks that require lots of I/O, SMT can increase application throughput significantly. For compute-bound jobs in which both virtual cores are compute-bound, SMT can hinder overall application performance and can add unpredictable variance to jobs. Turning off SMT allows more predictable performance and can decrease job times.

Important: Disabling SMT changes the way cores are counted, and may increase the cost per core of the cluster depending on how you count cores. Although cost per core is a common metric for on-premises hardware, a more appropriate metric for the cloud is cost per workload or cost per job. For compute-bound jobs, you pay for what you use. Turning off Hyper-Threading can reduce the overall runtime, which can reduce the overall cost of the job. We recommend that you benchmark your application and use this feature where it is beneficial.

You can disable Simultaneous Multithreading at VM creation on all VM types with the following exceptions:

  • VMs that run on machine types that have fewer than 2 vCPUs (such as n1-standard-1) or shared-core machines (such as e2-small).
  • VMs that run on the Tau T2D machine type.

When using slurm-gcp terraform modules, use option disable_smt to toggle Simultaneous Multithreading (SMT) on/off.

How do I automate custom cluster configurations?

The Slurm cluster module provide multiple variables (controller_startup_scripts, compute_startup_scripts, partition_startup_scripts) which allow you input a list of scripts which will be run on different sets of hosts at set-up time. The scripts are run synchronousely and a non-zero exit will fail the setup step of the instance. Generally, controller_startup_scripts will run only on the controller node; compute_startup_scripts will run on the log and all compute nodes, and partition_startup_scripts will on all compute nodes within that partition. See Slurm cluster module variables for details.

If you want to install software, it is recommended to bake it into the image. Doing so will speed up the deployment of bursted compute nodes. See customize image for more information.

How do I replace the controller?

Replacing the controller instance is a hazardous action.

It is reccommeded to:

  1. Drain the cluster of all jobs.
    • Optionally, state=power_down all nodes.
  2. Save and export all local data off the controller.
    • By default, the database (mariadb) and /home (NFS mounted) are local.
  3. Replace the controller instance by either:
    • Update tfvars configuration then terraform apply.
    • Or, manually terminate the controller instance then terraform apply.
  4. Reboot all instances with NFS mounts to the controller.
    • By default, this includes all login and compute nodes.