Skip to content

Commit

Permalink
Merge pull request #459 from carpentries-incubator/nist-blackbird-slurm
Browse files Browse the repository at this point in the history
Skeleton for a new cluster 😁
  • Loading branch information
reid-a authored Jan 13, 2025
2 parents 39981f1 + 6acb2d6 commit 6cd2636
Show file tree
Hide file tree
Showing 35 changed files with 401 additions and 2 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# NIST Blackbird cluster snippets
---
snippets: "/snippets_library/NIST_Blackbird_slurm"

local:
prompt: "user@laptop:~$"
bash_shebang: "#!/usr/bin/bash"

remote:
name: "Blackbird"
login: "blackbird.nist.gov"
host: "blackbird"
node: "bb001"
location: "National Institute of Standards and Technology"
homedir: "/home"
user: "abc"
prompt: "abc@blackbird:~$"
bash_shebang: "#!/bin/bash"

sched:
name: "Slurm"
submit:
name: "sbatch"
options: "--partition=batch"
queue:
debug: "batch"
testing: "batch"
status: "squeue"
flag:
user: "-u abc"
interactive: ""
histdetail: "--format=JobName,Submit,Start,State,ReqCPUS,Reserved,Elapsed,MaxRSS -j"
name: "-J"
time: "-t"
queue: "-p"
partition: "-p batch"
del: "scancel"
interactive: "srun"
info: "sinfo"
comment: "#SBATCH"
hist: "sacct -u abc"
hist_filter: ""

episode_order:
- 10-hpc-intro
- 11-connecting
- 12-cluster
- 13-scheduler
- 14-environment-variables
- 16-transferring-files
- 17-parallel
- 18-resources
- 19-responsibility
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
```
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
batch up 7-00:00:00 17 drng bb[003-018,027,029,035]
batch up 7-00:00:00 3 alloc bb[020-022]
batch up 7-00:00:00 3 drain bb[047,059,080]
batch up 7-00:00:00 47 down* bb[019,023-026,028,030,036-046,048-052,054,056-058,060-079]
batch up 7-00:00:00 8 idle bb[001-002,031-034,053,055]
```
{: .output}
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
```
bin etc lib64 proc sbin sys var
boot {{ site.remote.homedir | replace: "/", "" }} mnt root scratch tmp working
dev lib opt run srv usr
```
{: .output}
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
> ## Explore a Worker Node
>
> Finally, let's look at the resources available on the worker nodes where your
> jobs will actually run. Try running this command to see the name, CPUs and
> memory available on the worker nodes:
>
> ```
> {{ site.remote.prompt }} sinfo -n {{ site.remote.node }} -o "%n %c %m"
> ```
> {: .language-bash}
{: .challenge}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
<!-- CTCMS does not use Modules -->
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
```
No Modulefiles Currently Loaded.
```
{: .output}
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
```
```
{: .output}
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
```
{{ site.remote.prompt }} which python3
```
{: .language-bash}
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
```
/usr/bin/python3
```
{: .output}
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
```
{{ site.remote.prompt }} ls /usr/bin/py*
```
{: .language-bash}
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
```
py3clean pydoc3.5 python2 python3-config
py3compile pygettext python2.7 python3-futurize
py3versions pygettext2.7 python2.7-config python3m
pybuild pygettext3 python2-config python3m-config
pyclean pygettext3.5 python3 python3-pasteurize
pycompile pygobject-codegen-2.0 python3.5 python-config
pydoc pygtk-codegen-2.0 python3.5-config pyversions
pydoc2.7 pygtk-demo python3.5m
pydoc3 python python3.5m-config
```
{: .output}
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
```
```
{: .output}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
<!-- CTCMS does not use modules -->
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
<!-- CTCMS does not use modules -->
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
```
{{ site.remote.bash_shebang }}
{{ site.sched.comment }} {{ site.sched.flag.name }} parallel-job
{{ site.sched.comment }} {{ site.sched.flag.queue }} {{ site.sched.queue.testing }}
{{ site.sched.comment }} -N 1
{{ site.sched.comment }} -n 8
# Execute the task
mpiexec amdahl
```
{: .language-bash}
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
```
{{ site.remote.bash_shebang }}
{{ site.sched.comment }} {{ site.sched.flag.name }} parallel-job
{{ site.sched.comment }} {{ site.sched.flag.queue }} {{ site.sched.queue.testing }}
{{ site.sched.comment }} -N 1
{{ site.sched.comment }} -n 4
# Execute the task
mpiexec amdahl
```
{: .language-bash}
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
```
{{ site.remote.bash_shebang }}
{{ site.sched.comment }} {{ site.sched.flag.name }} solo-job
{{ site.sched.comment }} {{ site.sched.flag.queue }} {{ site.sched.queue.testing }}
{{ site.sched.comment }} -N 1
{{ site.sched.comment }} -n 1
# Execute the task
amdahl
```
{: .language-bash}
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
```
JobID JobName Partition AllocCPUS State Exit
------------ ---------- ---------- ---------- ---------- ----
212339 hostname {{ site.sched.queue.debug }} 2 COMPLETED
212340 hostname {{ site.sched.queue.debug }} 2 COMPLETED
212341 env {{ site.sched.queue.debug }} 2 COMPLETED
212342 mpirun {{ site.sched.queue.testing }} 2 COMPLETED
212343 mpirun {{ site.sched.queue.testing }} 2 COMPLETED
212344 amdahl {{ site.sched.queue.testing }} 2 COMPLETED
212345 amdahl {{ site.sched.queue.testing }} 2 COMPLETED
212346 bash {{ site.sched.queue.testing }} 2 COMPLETED
212346.0 bash 2 COMPLETED
212346.1 amdahl 2 COMPLETED
212347 amdahl {{ site.sched.queue.testing }} 2 FAILED
```
{: .output}
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
* **Hostname**: Where did your job run?
* **MaxRSS**: What was the maximum amount of memory used?
* **Elapsed**: How long did the job take?
* **State**: What is the job currently doing/what happened to it?
* **MaxDiskRead**: Amount of data read from disk.
* **MaxDiskWrite**: Amount of data written to disk.
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
```
top - 15:47:18 up 21 days, 6:25, 2 users, load average: 0.02, 0.04, 0.04
Tasks: 223 total, 1 running, 222 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.2 us, 0.1 sy, 0.0 ni, 99.6 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 32950812 total, 1594456 free, 502696 used, 30853660 buff/cache
KiB Swap: 64002952 total, 64002952 free, 0 used. 31913980 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1693 jeff 20 0 4270580 346944 171372 S 29.8 2.1 9:31.89 gnome-shell
3140 jeff 20 0 3142044 928972 389716 S 27.5 5.7 13:30.29 Web Content
3057 jeff 20 0 3115900 521368 231288 S 18.9 3.2 10:27.71 firefox
6007 jeff 20 0 813992 112336 75592 S 4.3 0.7 0:28.25 tilix
1742 jeff 20 0 975080 164508 130624 S 2.0 1.0 3:29.83 Xwayland
1 root 20 0 230484 11924 7544 S 0.3 0.1 0:06.08 systemd
68 root 20 0 0 0 0 I 0.3 0.0 0:01.25 kworker/4:1
2913 jeff 20 0 965620 47892 37432 S 0.3 0.3 0:11.76 code
2 root 20 0 0 0 0 S 0.0 0.0 0:00.02 kthreadd
```
{: .output}
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
```
total used free shared buff/cache available
Mem: 31G 501M 1.5G 64M 29G 30G
Swap: 61G 0B 61G
```
{: .output}
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
```
Submitted batch job 36855
```
{: .output}
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
```
JOBID PARTITION NAME ST TIME NODES NODELIST(REASON)
212201 {{ site.sched.queue.debug }} example- R 0:05 1 r002
```
{: .output}

We can see all the details of our job, most importantly that it is in the `R`
or `RUNNING` state. Sometimes our jobs might need to wait in a queue
(`PENDING`) or have an error (`E`).
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
> Jobs on an HPC system might run for days or even weeks. We probably have
> better things to do than constantly check on the status of our job with
> `{{ site.sched.status }}`. Looking at the manual page for
> `{{ site.sched.submit.name }}`, can you set up our test job to send you an email
> when it finishes?
>
> > ## Hint
> >
> > You can use the *manual pages* for {{ site.sched.name }} utilities to find
> > more about their capabilities. On the command line, these are accessed
> > through the `man` utility: run `man <program-name>`. You can find the same
> > information online by searching > "man <program-name>".
> >
> > ```
> > {{ site.remote.prompt }} man {{ site.sched.submit.name }}
> > ```
> > {: .language-bash}
> {: .solution}
{: .challenge}
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
```
JOBID PARTITION NAME ST TIME NODES NODELIST(REASON)
212202 {{ site.sched.queue.debug }} hello-wo R 0:02 1 r002
```
{: .output}
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
* `--ntasks=<ntasks>` or `-n <ntasks>`: How many CPU cores does your job need,
in total?

* `--time <days-hours:minutes:seconds>` or `-t <days-hours:minutes:seconds>`:
How much real-world time (walltime) will your job take to run? The `<days>`
part can be omitted.

* `--mem=<megabytes>`: How much memory on a node does your job need in
megabytes? You can also specify gigabytes using by adding a little "g"
afterwards (example: `--mem=5g`)

* `--nodes=<nnodes>` or `-N <nnodes>`: How many separate machines does your job
need to run on? Note that if you set `ntasks` to a number greater than what
one machine can offer, {{ site.sched.name }} will set this value
automatically.
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
> ## Job environment variables
>
> When {{ site.sched.name }} runs a job, it sets a number of environment
> variables for the job. One of these will let us check what directory our job
> script was submitted from. The `SLURM_SUBMIT_DIR` variable is set to the
> directory from which our job was submitted. Using the `SLURM_SUBMIT_DIR`
> variable, modify your job so that it prints out the location from which the
> job was submitted.
>
> > ## Solution
> >
> > ```
> > {{ site.remote.prompt }} nano example-job.sh
> > {{ site.remote.prompt }} cat example-job.sh
> > ```
> > {: .language-bash}
> >
> > ```
> > {{ site.remote.bash_shebang }}
> > {{ site.sched.comment }} {{ site.sched.flag.partition }}
> > {{ site.sched.comment }} {{ site.sched.flag.time }} 00:00:20
> >
> > echo -n "This script is running on "
> > hostname
> >
> > echo "This job was launched in the following directory:"
> > echo ${SLURM_SUBMIT_DIR}
> > ```
> > {: .output}
> {: .solution}
{: .challenge}
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
```
{{ site.remote.prompt }} cat slurm-38193.out
```
{: .language-bash}
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
```
This job is running on:
{{ site.sched.node }}
slurmstepd: error: *** JOB 38193 ON {{ site.sched.node }} CANCELLED AT
2017-07-02T16:35:48 DUE TO TIME LIMIT ***
```
{: .output}
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
```
Submitted batch job 212203
JOBID PARTITION NAME ST TIME NODES NODELIST(REASON)
212203 {{ site.sched.queue.debug }} hello-wo R 0:03 1 r002
```
{: .output}
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
```
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
```
{: .output}
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
> ## Cancelling multiple jobs
>
> We can also all of our jobs at once using the `-u` option. This will delete
> all jobs for a specific user (in this case, yourself). Note that you can only
> delete your own jobs.
>
> Try submitting multiple jobs and then cancelling them all.
>
> > ## Solution
> >
> > First, submit a trio of jobs:
> >
> > ```
> > {{ site.remote.prompt }} {{ site.sched.submit.name }} {% if site.sched.submit.options != '' %}{{ site.sched.submit.options }} {% endif %}example-job.sh
> > {{ site.remote.prompt }} {{ site.sched.submit.name }} {% if site.sched.submit.options != '' %}{{ site.sched.submit.options }} {% endif %}example-job.sh
> > {{ site.remote.prompt }} {{ site.sched.submit.name }} {% if site.sched.submit.options != '' %}{{ site.sched.submit.options }} {% endif %}example-job.sh
> > ```
> > {: .language-bash}
> >
> > Then, cancel them all:
> >
> > ```
> > {{ site.remote.prompt }} {{ site.sched.del }} -u yourUsername
> > ```
> > {: .language-bash}
> {: .solution}
{: .challenge}
Loading

0 comments on commit 6cd2636

Please sign in to comment.