How to use Compute Canada

Getting Started

Create a CCDB account. Ask your supervisor (Liam/Glen CCRI fju-421-02) to add you to their account using your account ID.
Once your account has been approved, you should be able to SSH into the Compute Canada cluster. Assuming you’re attempting to login onto Narval, the command would be: ssh {your-username}@narval.computecanada.ca. For information on computing clusters, you can search for the hostname on this wiki.
Now set up Globus so you can transfer files. Start by creating an account (probably use your Google account since UdeM/Mila aren't recognized by Globus). Then, go here and log in. Download Globus Connect Personnal and set it up on your machine. Give it a few minutes, and then refresh the web app: your endpoint should show up (you can search for it in the seach bar)

Finally, search for a CCDB machine.

It'll ask you to log in a billion times, but once you're fully logged in, you're done. You can now transfer files from CCDB to your machine and vice versa.
Files are stored in a certain way on Compute Canada. You can learn more about it here. It’s recommended that you store all your files according to the following:
- Code at: ~/projects/def-lpaull/{your-username}.
- Datasets at: ~/projects/def-lpaull/Datasets.
- Singularity images at: ~/projects/def-lpaull/Singularity.
Go over Your First Job to submit your first job. Then go over Containerization.
Your supervisor can have multiple accounts. Both Liam and Glen have default accounts begining with def- (def-gberseth and def-lpaull). Those accounts usually are not associated with any project-specific allocations. Your supervisor might also have an account the begins with rrg-. Those accounts are a resultant of winning resource allocation competitions, and should be used as much as possible.
As a general rule, def accounts do not have a GPU allocation, meaning you will wait long times to get your job to run on a CPU/GPU node. rrg accounts can have GPU, CPU, or a mix of both as allocations. If your supervisor only has GPU allocation for their rrg account, you must use a GPU when running a job as CPU-only jobs will result in error messages.
Congratulations! You’ve successfully submitted your first jobs! You can learn more about Compute Canada on the main wiki page.

Your First Job (tutorial)

Connecting to compute canada server

The first thing to do is to connect to their server.

ssh <your_username>@narval.computecanada.ca

If you can't connect at this point, you would want to check the server status.

Though it is not recomended for performance purpose, you can log-in with a GUI by using the x11 forwarding option ssh -Y to connect to nodes that enables this feature.

Submit a simple serial job

Execution of one program is called a job and you will be required to create a job script to interact with the HPCs. The HPC uses a job scheduler called SLURM to decide when and where the job will be run using the job script. A serial job is not parrallelized, and is the simplest type of job.

Create a simple job script simple_job.sh that will output a sentence

#!/bin/bash
#SBATCH --time=00:01:00
#SBATCH --account=def-lpaull
echo 'Hello HPC world !'
sleep 5s

#SBATCH specify what options you want to give to SLURM. You can add more informations like the job name --job-name=test, the output test-%J.out...

Transfer this file with Globus to the server. Like mentioned earlier, you should use your <username> forlder inside def-<advisor>.
Submit the job script with SLURM

sbatch simple_job.sh

To check the status of the job in the queue (time remaining, finish status etc..) you can type

squeue -u <user_name>

When it is done, the output will be available in a file called slurm-<id_of_job> .out

Containerization

Since you can't access the internet from CCDB, you need to containerize all your dependencies.

Docker is a common and powerfull tool to bundle or "containerize" application into a virtual environment. This will help you to deploy easilly your work, without worrying about the environment where it will be used. You can't use docker on HPCs because you need admin rights to run it, but singularity doesn't need sudo rights.

Singularity is a "super docker" system. A single image contains everything required to run your script on Compute Canada. It is now known as Apptainer.

Installing Apptainer

Go find an appropriate release

If it's a .deb or .rpm or other app image, you can just run it. Otherwise,

export VERSION=1.0.2 && # adjust this as necessary \
wget https://github.com/apptainer/apptainer/releases/download/v${VERSION}/apptainer-${VERSION}.tar.gz && \
tar -xzf apptainer-${VERSION}.tar.gz && \
cd apptainer

Submit a simple Apptainer job

Create a single python script par_job.py inside you home ~/ that will output number from a to b, every 10s
```
import sys
import time

for i in range(int(sys.argv[1]), int(sys.argv[2])):
    print(i)
    time.sleep(10) 
```
You can test it with python3 par_job.py 1 10 and it should output all the numbers from 1 to 10 after 100s

Build a docker image for your script Dockerfile:

FROM python:3.9-buster
WORKDIR workdir 
COPY par_job.py .
ENTRYPOINT ["python3", "par_job.py"]

Build:

docker build -t parjob .

Trial run:

docker run -it parjob 1 10

Transform your Docker image into an Apptainer image

sudo APPTAINER_NOHTTPS=1 apptainer build parjob.sif docker-daemon://parjob:latest
sudo chmod 777 parjob.sif

Run par_job.py inside the Apptainer image
```
singularity run parjob.sif 1 10
```
Notice that this only works if you're in the same directory where you built the Singularity image: running singularity automatically mounts your directory, so it finds the par_job.py script there. Really, you should use
```
singularity exec parjob.sif python3 /workdir/par_job.py 1 10
```
Similarly, you can run bash to explore the contents of the image.
```
singularity exec parjob.sif bash
```
Transfer the singularity image from your computer to narval ~/project/def-lpaull/<user_name>/ and create in the home folder a file params that will be used later
```
1 10
11 20
21 30
31 40
41 50
51 60
61 70
71 80
81 90
91 100
```
We will submit a whole batch of job with just one script simple_ar_job.sh. This will allows us to run our application in parrallel through many nodes on computecanada. DON'T FORGET TO CHANGE <user_name>!
```
#!/bin/bash
#SBATCH --time=00:20:00
#SBATCH --account=def-lpaull
#SBATCH --array=1-10
    
module load singularity/3.8
PARAMS=$(cat params | head -n $SLURM_ARRAY_TASK_ID| tail -n 1)
echo $PARAMS

singularity --quiet exec -B ~/projects/def-lpaull/<user_name>/:/output parjob.sif python3 /workdir/par_job.py ${PARAMS[0]} ${PARAMS[1]}
```
The line #SBATCH --array=1-10 tells you that this is a job array and you specify here that you will run 10 parrallel jobs. Using --array=1-10%2 tells you that you want no more than 2 jobs running in parrallel, --array=1-10:2 is equivalent to --array=1,3,5,7,9.

PARAMS=$(cat params | head -n $SLURM_ARRAY_TASK_ID| tail -n 1) is used to read all the parameters that you want to pass to the python script from the file params.

The option singularity --quiet exec -B ~/projects/def-lpaull/<user_name>/:/output allows you to mount the directory on your host ~/projects/def-lpaull/<user_name>/ to the directory on the container /output. In a real job, you could bind it inside /workdir/ so that your script can write to it.
Now you can submit the script to SLURM !
```
sbatch simple_ar_job.sh
```
When your jobs is running, check the process for one job in one of the node by running
```
srun --jobid <job_id> --pty htop -u <user_name>
```
srun will allow you to run something (in our case htop) in parrallel.
When the jobs are finished, check the log and all the slurm-<jobid>.out
```
seff <job_id>
```
It is possible to allow slack to send you notifications when a job is running, finished etc.. First create a mail in slack in preferences under messages and media section. Then, you can use the provided email address to let SLURM send you notifications in slack (it will be sent by the slackbot). Just insert the following in your .sh job script
```
#SBATCH [email protected] 
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
```

Example Job Script:

#!/bin/bash
#SBATCH --account=def-lpaull # Use 'rrg' account for better chance to run GPU
#SBATCH --time=5:00:00  # Time you think your experiment will take. Experiment gets killed if this time is exceeded. Shorter experiments usually get priority in queue.
#SBATCH --job-name=aaaa #Job name that will appear from ssh'd terminal
#SBATCH --ntasks=1 # Number of tasks per node. Generally keep as 1.
#SBATCH --cpus-per-task=8           # CPU cores/threads. 3.5 cores / gpu is standard.
#SBATCH --gres=gpu:a100:2                  # Number of GPUs (per node)
#SBATCH --mem=64Gb                  # RAM allowed per node
#SBATCH --output=%j.out   # STDOUT

# Load modules
module load cuda/11.1.1 cudnn/8.2.0
module load singularity/3.8

# Run training
cd $SLURM_TMPDIR/continual-semantic-segmentation
singularity exec --nv --home $SLURM_TMPDIR /home/$USER/projects/def-lpaull/$USER/singularity_image.sif python train.py --configurations

Maintainer: Ali Harakeh, Charlie Gauthier

Provide feedback

Saved searches

Use saved searches to filter your results more quickly