Skip to content

Compute Canada servers

Robyn Wright edited this page Feb 16, 2021 · 1 revision

Compute Canada has several general purpose clusters and a large parallel supercomputer that you can register for. You will need to contact Morgan as you need the Compute Canada Role Identifier of your sponsor to be able to register. In my experience it then took a few hours for my account to be confirmed. Note that the username that you sign up with will be the name that you use to connect to the servers with.

Clusters available

Server Beluga Cedar Graham Niagara
Home storage 50 GB per user 526 TB (not allocated) 64 TB (not allocated) 200 TB
Scratch storage 20 TB per user 5.4 PB (not allocated) 3.6 PB (not allocated) 12.5 PB
Project storage 1 TB per group 23 PB (adjustable) 16 PB (adjustable) 3.5 PB
Other storage Burst buffer 232 TB
Archive 20 PB
Node characteristics 172: 40 cores / 92 GB
516: 40 cores / 186 GB
12: 40 cores / 752 GB
172: 40 cores / 186 GB
576: 32 cores / 125 GB
96: 32 cores / 250 GB
24: 32 cores / 502 GB
24: 32 cores / 1510 GB
4: 32 cores / 3022 GB
114: 24 cores / 125 GB
32: 24 cores / 250 GB
192: 32 cores / 187 GB
640: 48 cores / 187 GB
768: 48 cores / 187 GB
902: 32 cores / 125 GB
24: 32 cores / 502 GB
56: 32 cores / 250 GB
3: 64 cores / 3022 GB
160: 32 cores / 124 GB
7: 28 cores / 178 GB
2: 40 cores / 377 GB
6: 16 cores / 192 GB
30: 44 cores / 192 GB
72: 44 cores / 192 GB
2024: 40 cores / 202 GB
Job characteristics 1 hour minimum
7 days maximum
1000 jobs per user
1 hour minimum
1000 jobs per user

This is just a short summary, but there are further details on each on the Compute Canada wiki pages: Beluga, Cedar, Graham & Niagara. Note that Niagara access doesn't come as default, and you will need to request access if you need it (this can be done through My Account/Request access to other clusters).
There are also Cloud sites, but I'm not covering that here.

Each of these uses Slurm for job scheduling.

General notes

The logon for each of these is:
[email protected]
[email protected]
[email protected]
[email protected]

Your password will be the same one that you use to login to the website.

You can list all of the jobs that are either running or scheduled to run using squeue or only the jobs you have scheduled to run using sq. This output will look something like this:

(base) [rwright@gra-login2 scratch]$ sq
          JOBID     USER              ACCOUNT           NAME  ST  TIME_LEFT NODES CPUS TRES_PER_N MIN_MEM NODELIST (REASON) 
       44651061  rwright     def-mlangill_cpu  PGPC_0034.job  PD 1-00:00:00     1   40        N/A     25G  (Priority)

So this shows you a range of information, including the job ID, whether it is running or not (ST=status, PD=pending), what you have set it to run on and why it is or isn't running currently.

You can also list only pending/running jobs with: sq -t RUNNING / sq -t PENDING and get detailed information on a job by running: scontrol show job -dd $JOBID

You can cancel a job by running scancel $JOBID, e.g. scancel 44651061

Setting up and installing things

You can install things as you would on Vulcan, but to avoid needing to specify the path for tools that may not be globally installed, I have found the easiest way is to install things within conda environments and then you can activate these within a job script. Note that you will probably need to reconnect before conda or any new packages will be installed and ready to use.
You can run tests to check that things are installed properly (without submitting jobs), but they have a time limit of 5 mins and will automatically be cancelled if they run longer than this.

Submitting jobs

Job files

The easiest way to submit jobs is by creating a job file that looks something like this:

#!/bin/bash
#SBATCH --job-name=PGPC_0036.job
#SBATCH --output=/home/rwright/scratch/out/PGPC_0036.out
#SBATCH --error=/home/rwright/scratch/out/PGPC_0036.err
#SBATCH --mem=25G
#SBATCH --time=0-24:00
#SBATCH --cpus-per-task=40
#SBATCH [email protected]
#SBATCH --mail-type=ALL
source /home/rwright/.bashrc
conda activate kneaddata
python run_single_participant.py PGPC_0036

In this case, I've named the file PGPC_0036.job and it can be submitted to run using sbatch PGPC_0036.job So in this file we have set the job name, the location for output and error files, the memory needed, the maximum time, the CPUs (threads) that I want to be able to use, the email address to send updates to, which updates to send and then the commands to be run. At a minimum, you need to set the time to run and the command to be run.

  • Output: this file will look like the terminal output would for the commands that you run
  • Error: this file gives error messages for any commands that were unable to be run
  • Email: Using all means that you will be sent emails when your job starts running and when it ends (including whether it completed successfully or failed due to an error). You will also receive an email if you cancel your job. You don't need to have it, but I have found it especially useful for quickly finding out whether your jobs were successful.
  • Commands to run: You can see that I've set the bash profile for my user, activated the conda environment that has all of the packages I will use installed, and then run my python script. To ensure that everything runs properly, it is easiest to give the full path to any files being called by your scripts or tools used. If you are running lots of similar jobs then you might want to write a script that creates and submits these job files for you.

You can check the output of a completed job using seff $JOBID, which may look something like:

(base) [rwright@gra-login2 scratch]$ seff 44273374
Job ID: 44273374
Cluster: graham
User/Group: rwright/rwright
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 40
CPU Utilized: 13-06:18:07
CPU Efficiency: 58.48% of 22-16:17:20 core-walltime
Job Wall-clock time: 13:36:26
Memory Utilized: 136.10 GB
Memory Efficiency: 68.05% of 200.00 GB

So this shows me that my job completed, it ran on 1 node with 40 cores, the CPU it used and the efficiency it used this with, the time it took to run (wall clock time), the maximum amount of memory used and how much of the memory that I set as a maximum that this was. If you aren't sure how much memory or time to use for a bunch of jobs, you can set one job to run with a much higher time/memory allowance than you expect to use and then check the output. You can try to tailor the amount of memory that you set to what is available for each node - the more nodes on that cluster that are capable of carrying out your job, the quicker it is likely to be run.

You can get more detailed information with: sacct -j $JOBID

Running jobs from command line

You can also run the same thing by submitting all of this from the command line:

sbatch --job-name=job_name.job --output=/home/rwright/scratch/out/job_name.out --error=/home/rwright/scratch/out/job_name.err --mem=25G --time=0-24:00 job_script.sh

Running interactive jobs

You can run jobs interactively with salloc by running something like:

salloc --time=1:0:0 --ntasks=20 --mem-per-cpu 50G
salloc: Pending job allocation 15038192
salloc: job 15038192 queued and waiting for resources

When the resources for running this are available, you will be able to use terminal to run things with the restrictions that you've set (rather than the 5 minute maximum job time). You can quit this again by using: exit

More information

There is more information on running/submitting jobs on Compute Canada here, and a there are a range of other tutorials on using slurm elsewhere, e.g. here.

Clone this wiki locally