-
Notifications
You must be signed in to change notification settings - Fork 1
How to use Compute Canada
-
Create a CCDB account. Ask your supervisor (Liam/Glen CCRI fju-421-02) to add you to their account using your account ID.
-
Once your account has been approved, you should be able to SSH into the Compute Canada cluster. Assuming you’re attempting to login onto Narval, the command would be:
ssh {your-username}@narval.computecanada.ca
. For information on computing clusters, you can search for the hostname on this wiki. -
Now set up Globus so you can transfer files. Start by creating an account (probably use your Google account since UdeM/Mila aren't recognized by Globus). Then, go here and log in. Download Globus Connect Personnal and set it up on your machine. Give it a few minutes, and then refresh the web app: your endpoint should show up (you can search for it in the seach bar)
Finally, search for a CCDB machine.
It'll ask you to log in a billion times, but once you're fully logged in, you're done. You can now transfer files from CCDB to your machine and vice versa.
-
Files are stored in a certain way on Compute Canada. You can learn more about it here. It’s recommended that you store all your files according to the following:
- Code at:
~/projects/def-lpaull/{your-username}
. - Datasets at:
~/projects/def-lpaull/Datasets
. - Singularity images at:
~/projects/def-lpaull/Singularity
.
- Code at:
-
Go over Your First Job to submit your first job. Then go over Containerization.
-
Your supervisor can have multiple accounts. Both Liam and Glen have default accounts begining with
def-
(def-gberseth
anddef-lpaull
). Those accounts usually are not associated with any project-specific allocations. Your supervisor might also have an account the begins withrrg-
. Those accounts are a resultant of winning resource allocation competitions, and should be used as much as possible. -
As a general rule,
def
accounts do not have a GPU allocation, meaning you will wait long times to get your job to run on a CPU/GPU node.rrg
accounts can have GPU, CPU, or a mix of both as allocations. If your supervisor only has GPU allocation for theirrrg
account, you must use a GPU when running a job as CPU-only jobs will result in error messages. -
Congratulations! You’ve successfully submitted your first jobs! You can learn more about Compute Canada on the main wiki page.
The first thing to do is to connect to their server.
ssh <your_username>@narval.computecanada.ca
If you can't connect at this point, you would want to check the server status.
Though it is not recomended for performance purpose, you can log-in with a GUI by using the x11 forwarding option ssh -Y
to connect to nodes that enables this feature.
Execution of one program is called a job and you will be required to create a job script to interact with the HPCs. The HPC uses a job scheduler called SLURM to decide when and where the job will be run using the job script. A serial job is not parrallelized, and is the simplest type of job.
- Create a simple job script
simple_job.sh
that will output a sentence
#!/bin/bash
#SBATCH --time=00:01:00
#SBATCH --account=def-lpaull
echo 'Hello HPC world !'
sleep 5s
#SBATCH
specify what options you want to give to SLURM.
You can add more informations like the job name --job-name=test
, the output test-%J.out
...
-
Transfer this file with Globus to the server. Like mentioned earlier, you should use your
<username>
forlder insidedef-<advisor>
. -
Submit the job script with SLURM
sbatch simple_job.sh
- To check the status of the job in the queue (time remaining, finish status etc..) you can type
squeue -u <user_name>
- When it is done, the output will be available in a file called
slurm-<id_of_job> .out
<This section is adapted from this.>
Since you can't access the internet from CCDB, you need to containerize all your dependencies.
Docker is a common and powerfull tool to bundle or "containerize" application into a virtual environment. This will help you to deploy easilly your work, without worrying about the environment where it will be used. You can't use docker on HPCs because you need admin rights to run it, but singularity doesn't need sudo rights.
Singularity is a "super docker" system. A single image contains everything required to run your script on Compute Canada. It is now known as Apptainer.
- Go find an appropriate release
- If it's a
.deb
or.rpm
or other app image, you can just run it. Otherwise,export VERSION=1.0.2 && # adjust this as necessary \ wget https://github.com/apptainer/apptainer/releases/download/v${VERSION}/apptainer-${VERSION}.tar.gz && \ tar -xzf apptainer-${VERSION}.tar.gz && \ cd apptainer
-
Create a single python script
par_job.py
inside you home~/
that will output number from a to b, every 10simport sys import time for i in range(int(sys.argv[1]), int(sys.argv[2])): print(i) time.sleep(10)
You can test it with
python3 par_job.py 1 10
and it should output all the numbers from 1 to 10 after 100s -
Build a docker image for your script Dockerfile:
FROM python:3.9-buster WORKDIR workdir COPY par_job.py . ENTRYPOINT ["python3", "par_job.py"]
Build:
docker build -t parjob .
Trial run:
docker run -it parjob 1 10
-
Transform your Docker image into an Apptainer image
sudo APPTAINER_NOHTTPS=1 apptainer build parjob.sif docker-daemon://parjob:latest sudo chmod 777 parjob.sif
-
Run
par_job.py
inside the Apptainer imagesingularity run parjob.sif 1 10
Notice that this only works if you're in the same directory where you built the Singularity image: running singularity automatically mounts your directory, so it finds the
par_job.py
script there. Really, you should usesingularity exec parjob.sif python3 /workdir/par_job.py 1 10
Similarly, you can run
bash
to explore the contents of the image.singularity exec parjob.sif bash
-
Transfer the singularity image from your computer to narval
~/project/def-lpaull/<user_name>/
and create in the home folder a fileparams
that will be used later1 10 11 20 21 30 31 40 41 50 51 60 61 70 71 80 81 90 91 100
-
We will submit a whole batch of job with just one script
simple_ar_job.sh
. This will allows us to run our application in parrallel through many nodes on computecanada. DON'T FORGET TO CHANGE<user_name>
!#!/bin/bash #SBATCH --time=00:20:00 #SBATCH --account=def-lpaull #SBATCH --array=1-10 module load singularity/3.8 PARAMS=$(cat params | head -n $SLURM_ARRAY_TASK_ID| tail -n 1) echo $PARAMS singularity --quiet exec -B ~/projects/def-lpaull/<user_name>/:/output parjob.sif python3 /workdir/par_job.py ${PARAMS[0]} ${PARAMS[1]}
The line
#SBATCH --array=1-10
tells you that this is a job array and you specify here that you will run 10 parrallel jobs. Using--array=1-10%2
tells you that you want no more than 2 jobs running in parrallel,--array=1-10:2
is equivalent to--array=1,3,5,7,9
.PARAMS=$(cat params | head -n $SLURM_ARRAY_TASK_ID| tail -n 1)
is used to read all the parameters that you want to pass to the python script from the fileparams
.The option
singularity --quiet exec -B ~/projects/def-lpaull/<user_name>/:/output
allows you to mount the directory on your host~/projects/def-lpaull/<user_name>/
to the directory on the container/output
. In a real job, you could bind it inside/workdir/
so that your script can write to it. -
Now you can submit the script to SLURM !
sbatch simple_ar_job.sh
-
When your jobs is running, check the process for one job in one of the node by running
srun --jobid <job_id> --pty htop -u <user_name>
srun
will allow you to run something (in our casehtop
) in parrallel. -
When the jobs are finished, check the log and all the
slurm-<jobid>.out
seff <job_id>
It is possible to allow slack to send you notifications when a job is running, finished etc.. First create a mail in slack in
preferences
undermessages and media
section. Then, you can use the provided email address to let SLURM send you notifications in slack (it will be sent by the slackbot). Just insert the following in your.sh
job script#SBATCH [email protected] #SBATCH --mail-type=BEGIN #SBATCH --mail-type=END
#!/bin/bash
#SBATCH --account=def-lpaull # Use 'rrg' account for better chance to run GPU
#SBATCH --time=5:00:00 # Time you think your experiment will take. Experiment gets killed if this time is exceeded. Shorter experiments usually get priority in queue.
#SBATCH --job-name=aaaa #Job name that will appear from ssh'd terminal
#SBATCH --ntasks=1 # Number of tasks per node. Generally keep as 1.
#SBATCH --cpus-per-task=8 # CPU cores/threads. 3.5 cores / gpu is standard.
#SBATCH --gres=gpu:a100:2 # Number of GPUs (per node)
#SBATCH --mem=64Gb # RAM allowed per node
#SBATCH --output=%j.out # STDOUT
# Load modules
module load cuda/11.1.1 cudnn/8.2.0
module load singularity/3.8
# Run training
cd $SLURM_TMPDIR/continual-semantic-segmentation
singularity exec --nv --home $SLURM_TMPDIR /home/$USER/projects/def-lpaull/$USER/singularity_image.sif python train.py --configurations
Maintainer: Ali Harakeh, Charlie Gauthier