Skip to content

How to use Compute Canada

Glen Berseth edited this page Sep 8, 2022 · 25 revisions

Getting Started

  1. Create a CCDB account. Ask your supervisor (Liam/Glen CCRI fju-421-02) to add you to their account using your account ID.

  2. Once your account has been approved, you should be able to SSH into the Compute Canada cluster. Assuming you’re attempting to login onto Narval, the command would be: ssh {your-username}@narval.computecanada.ca. For information on computing clusters, you can search for the hostname on this wiki.

  3. Now set up Globus so you can transfer files. Start by creating an account (probably use your Google account since UdeM/Mila aren't recognized by Globus). Then, go here and log in. Download Globus Connect Personnal and set it up on your machine. Give it a few minutes, and then refresh the web app: your endpoint should show up (you can search for it in the seach bar)

    image

    Finally, search for a CCDB machine.

    image

    It'll ask you to log in a billion times, but once you're fully logged in, you're done. You can now transfer files from CCDB to your machine and vice versa.

  4. Files are stored in a certain way on Compute Canada. You can learn more about it here. It’s recommended that you store all your files according to the following:

    • Code at: ~/projects/def-lpaull/{your-username}.
    • Datasets at: ~/projects/def-lpaull/Datasets.
    • Singularity images at: ~/projects/def-lpaull/Singularity.
  5. Go over Your First Job to submit your first job. Then go over Containerization.

  6. Your supervisor can have multiple accounts. Both Liam and Glen have default accounts begining with def- (def-gberseth and def-lpaull). Those accounts usually are not associated with any project-specific allocations. Your supervisor might also have an account the begins with rrg-. Those accounts are a resultant of winning resource allocation competitions, and should be used as much as possible.

  7. As a general rule, def accounts do not have a GPU allocation, meaning you will wait long times to get your job to run on a CPU/GPU node. rrg accounts can have GPU, CPU, or a mix of both as allocations. If your supervisor only has GPU allocation for their rrg account, you must use a GPU when running a job as CPU-only jobs will result in error messages.

  8. Congratulations! You’ve successfully submitted your first jobs! You can learn more about Compute Canada on the main wiki page.

Your First Job (tutorial)

Connecting to compute canada server

The first thing to do is to connect to their server.

ssh <your_username>@narval.computecanada.ca

If you can't connect at this point, you would want to check the server status.

Though it is not recomended for performance purpose, you can log-in with a GUI by using the x11 forwarding option ssh -Y to connect to nodes that enables this feature.

Submit a simple serial job

Execution of one program is called a job and you will be required to create a job script to interact with the HPCs. The HPC uses a job scheduler called SLURM to decide when and where the job will be run using the job script. A serial job is not parrallelized, and is the simplest type of job.

  1. Create a simple job script simple_job.sh that will output a sentence
#!/bin/bash
#SBATCH --time=00:01:00
#SBATCH --account=def-lpaull
echo 'Hello HPC world !'
sleep 5s

#SBATCH specify what options you want to give to SLURM. You can add more informations like the job name --job-name=test, the output test-%J.out...

  1. Transfer this file with Globus to the server. Like mentioned earlier, you should use your <username> forlder inside def-<advisor>.

  2. Submit the job script with SLURM

sbatch simple_job.sh
  1. To check the status of the job in the queue (time remaining, finish status etc..) you can type
squeue -u <user_name>
  1. When it is done, the output will be available in a file called slurm-<id_of_job> .out

Containerization

<This section is adapted from this.>

Since you can't access the internet from CCDB, you need to containerize all your dependencies.

Docker is a common and powerfull tool to bundle or "containerize" application into a virtual environment. This will help you to deploy easilly your work, without worrying about the environment where it will be used. You can't use docker on HPCs because you need admin rights to run it, but singularity doesn't need sudo rights.

Singularity is a "super docker" system. A single image contains everything required to run your script on Compute Canada. It is now known as Apptainer.

Installing Apptainer

  1. Go find an appropriate release
  2. If it's a .deb or .rpm or other app image, you can just run it. Otherwise,
    export VERSION=1.0.2 && # adjust this as necessary \
    wget https://github.com/apptainer/apptainer/releases/download/v${VERSION}/apptainer-${VERSION}.tar.gz && \
    tar -xzf apptainer-${VERSION}.tar.gz && \
    cd apptainer
    

Submit a simple Apptainer job

  1. Create a single python script par_job.py inside you home ~/ that will output number from a to b, every 10s

    import sys
    import time
    
    for i in range(int(sys.argv[1]), int(sys.argv[2])):
        print(i)
        time.sleep(10) 
    

    You can test it with python3 par_job.py 1 10 and it should output all the numbers from 1 to 10 after 100s

  2. Build a docker image for your script Dockerfile:

    FROM python:3.9-buster
    WORKDIR workdir 
    COPY par_job.py .
    ENTRYPOINT ["python3", "par_job.py"]
    

    Build:

    docker build -t parjob .
    

    Trial run:

    docker run -it parjob 1 10
    
  3. Transform your Docker image into an Apptainer image

    sudo APPTAINER_NOHTTPS=1 apptainer build parjob.sif docker-daemon://parjob:latest
    sudo chmod 777 parjob.sif
    
  4. Run par_job.py inside the Apptainer image

    singularity run parjob.sif 1 10
    

    Notice that this only works if you're in the same directory where you built the Singularity image: running singularity automatically mounts your directory, so it finds the par_job.py script there. Really, you should use

    singularity exec parjob.sif python3 /workdir/par_job.py 1 10
    

    Similarly, you can run bash to explore the contents of the image.

    singularity exec parjob.sif bash
    
  5. Transfer the singularity image from your computer to narval ~/project/def-lpaull/<user_name>/ and create in the home folder a file params that will be used later

    1 10
    11 20
    21 30
    31 40
    41 50
    51 60
    61 70
    71 80
    81 90
    91 100
    
  6. We will submit a whole batch of job with just one script simple_ar_job.sh. This will allows us to run our application in parrallel through many nodes on computecanada. DON'T FORGET TO CHANGE <user_name>!

    #!/bin/bash
    #SBATCH --time=00:20:00
    #SBATCH --account=def-lpaull
    #SBATCH --array=1-10
        
    module load singularity/3.8
    PARAMS=$(cat params | head -n $SLURM_ARRAY_TASK_ID| tail -n 1)
    echo $PARAMS
    
    singularity --quiet exec -B ~/projects/def-lpaull/<user_name>/:/output parjob.sif python3 /workdir/par_job.py ${PARAMS[0]} ${PARAMS[1]}
    

    The line #SBATCH --array=1-10 tells you that this is a job array and you specify here that you will run 10 parrallel jobs. Using --array=1-10%2 tells you that you want no more than 2 jobs running in parrallel, --array=1-10:2 is equivalent to --array=1,3,5,7,9.

    PARAMS=$(cat params | head -n $SLURM_ARRAY_TASK_ID| tail -n 1) is used to read all the parameters that you want to pass to the python script from the file params.

    The option singularity --quiet exec -B ~/projects/def-lpaull/<user_name>/:/output allows you to mount the directory on your host ~/projects/def-lpaull/<user_name>/ to the directory on the container /output. In a real job, you could bind it inside /workdir/ so that your script can write to it.

  7. Now you can submit the script to SLURM !

    sbatch simple_ar_job.sh
    
  8. When your jobs is running, check the process for one job in one of the node by running

    srun --jobid <job_id> --pty htop -u <user_name>
    

    srun will allow you to run something (in our case htop) in parrallel.

  9. When the jobs are finished, check the log and all the slurm-<jobid>.out

    seff <job_id>
    

    It is possible to allow slack to send you notifications when a job is running, finished etc.. First create a mail in slack in preferences under messages and media section. Then, you can use the provided email address to let SLURM send you notifications in slack (it will be sent by the slackbot). Just insert the following in your .sh job script

    #SBATCH [email protected] 
    #SBATCH --mail-type=BEGIN
    #SBATCH --mail-type=END
    

Example Job Script:

#!/bin/bash
#SBATCH --account=def-lpaull # Use 'rrg' account for better chance to run GPU
#SBATCH --time=5:00:00  # Time you think your experiment will take. Experiment gets killed if this time is exceeded. Shorter experiments usually get priority in queue.
#SBATCH --job-name=aaaa #Job name that will appear from ssh'd terminal
#SBATCH --ntasks=1 # Number of tasks per node. Generally keep as 1.
#SBATCH --cpus-per-task=8           # CPU cores/threads. 3.5 cores / gpu is standard.
#SBATCH --gres=gpu:a100:2                  # Number of GPUs (per node)
#SBATCH --mem=64Gb                  # RAM allowed per node
#SBATCH --output=%j.out   # STDOUT

# Load modules
module load cuda/11.1.1 cudnn/8.2.0
module load singularity/3.8

# Run training
cd $SLURM_TMPDIR/continual-semantic-segmentation
singularity exec --nv --home $SLURM_TMPDIR /home/$USER/projects/def-lpaull/$USER/singularity_image.sif python train.py --configurations

Maintainer: Ali Harakeh, Charlie Gauthier