-
Notifications
You must be signed in to change notification settings - Fork 2
Compute Canada servers
Compute Canada has several general purpose clusters and a large parallel supercomputer that you can register for. You will need to contact Morgan as you need the Compute Canada Role Identifier of your sponsor to be able to register. In my experience it then took a few hours for my account to be confirmed. Note that the username that you sign up with will be the name that you use to connect to the servers with.
Server | Beluga | Cedar | Graham | Niagara |
---|---|---|---|---|
Home storage | 50 GB per user | 526 TB (not allocated) | 64 TB (not allocated) | 200 TB |
Scratch storage | 20 TB per user | 5.4 PB (not allocated) | 3.6 PB (not allocated) | 12.5 PB |
Project storage | 1 TB per group | 23 PB (adjustable) | 16 PB (adjustable) | 3.5 PB |
Other storage | Burst buffer 232 TB Archive 20 PB |
|||
Node characteristics |
172: 40 cores / 92 GB 516: 40 cores / 186 GB 12: 40 cores / 752 GB 172: 40 cores / 186 GB |
576: 32 cores / 125 GB 96: 32 cores / 250 GB 24: 32 cores / 502 GB 24: 32 cores / 1510 GB 4: 32 cores / 3022 GB 114: 24 cores / 125 GB 32: 24 cores / 250 GB 192: 32 cores / 187 GB 640: 48 cores / 187 GB 768: 48 cores / 187 GB |
902: 32 cores / 125 GB 24: 32 cores / 502 GB 56: 32 cores / 250 GB 3: 64 cores / 3022 GB 160: 32 cores / 124 GB 7: 28 cores / 178 GB 2: 40 cores / 377 GB 6: 16 cores / 192 GB 30: 44 cores / 192 GB 72: 44 cores / 192 GB |
2024: 40 cores / 202 GB |
Job characteristics | 1 hour minimum 7 days maximum 1000 jobs per user |
1 hour minimum 1000 jobs per user |
This is just a short summary, but there are further details on each on the Compute Canada wiki pages: Beluga, Cedar, Graham & Niagara. Note that Niagara access doesn't come as default, and you will need to request access if you need it (this can be done through My Account/Request access to other clusters).
There are also Cloud sites, but I'm not covering that here.
Each of these uses Slurm for job scheduling.
The logon for each of these is:
[email protected]
[email protected]
[email protected]
[email protected]
Your password will be the same one that you use to login to the website.
You can list all of the jobs that are either running or scheduled to run using squeue
or only the jobs you have scheduled to run using sq
. This output will look something like this:
(base) [rwright@gra-login2 scratch]$ sq
JOBID USER ACCOUNT NAME ST TIME_LEFT NODES CPUS TRES_PER_N MIN_MEM NODELIST (REASON)
44651061 rwright def-mlangill_cpu PGPC_0034.job PD 1-00:00:00 1 40 N/A 25G (Priority)
So this shows you a range of information, including the job ID, whether it is running or not (ST=status, PD=pending), what you have set it to run on and why it is or isn't running currently.
You can also list only pending/running jobs with:
sq -t RUNNING
/ sq -t PENDING
and get detailed information on a job by running:
scontrol show job -dd $JOBID
You can cancel a job by running scancel $JOBID
, e.g. scancel 44651061
You can install things as you would on Vulcan, but to avoid needing to specify the path for tools that may not be globally installed, I have found the easiest way is to install things within conda environments and then you can activate these within a job script. Note that you will probably need to reconnect before conda or any new packages will be installed and ready to use.
You can run tests to check that things are installed properly (without submitting jobs), but they have a time limit of 5 mins and will automatically be cancelled if they run longer than this.
The easiest way to submit jobs is by creating a job file that looks something like this:
#!/bin/bash
#SBATCH --job-name=PGPC_0036.job
#SBATCH --output=/home/rwright/scratch/out/PGPC_0036.out
#SBATCH --error=/home/rwright/scratch/out/PGPC_0036.err
#SBATCH --mem=25G
#SBATCH --time=0-24:00
#SBATCH --cpus-per-task=40
#SBATCH [email protected]
#SBATCH --mail-type=ALL
source /home/rwright/.bashrc
conda activate kneaddata
python run_single_participant.py PGPC_0036
In this case, I've named the file PGPC_0036.job
and it can be submitted to run using sbatch PGPC_0036.job
So in this file we have set the job name, the location for output and error files, the memory needed, the maximum time, the CPUs (threads) that I want to be able to use, the email address to send updates to, which updates to send and then the commands to be run. At a minimum, you need to set the time to run and the command to be run.
- Output: this file will look like the terminal output would for the commands that you run
- Error: this file gives error messages for any commands that were unable to be run
-
Email: Using all means that you will be sent emails when your job starts running and when it ends (including whether it completed successfully or failed due to an error). You will also receive an email if you cancel your job. You don't need to have it, but I have found it especially useful for quickly finding out whether your jobs were successful.
- Commands to run: You can see that I've set the bash profile for my user, activated the conda environment that has all of the packages I will use installed, and then run my python script. To ensure that everything runs properly, it is easiest to give the full path to any files being called by your scripts or tools used. If you are running lots of similar jobs then you might want to write a script that creates and submits these job files for you.
You can check the output of a completed job using seff $JOBID
, which may look something like:
(base) [rwright@gra-login2 scratch]$ seff 44273374
Job ID: 44273374
Cluster: graham
User/Group: rwright/rwright
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 40
CPU Utilized: 13-06:18:07
CPU Efficiency: 58.48% of 22-16:17:20 core-walltime
Job Wall-clock time: 13:36:26
Memory Utilized: 136.10 GB
Memory Efficiency: 68.05% of 200.00 GB
So this shows me that my job completed, it ran on 1 node with 40 cores, the CPU it used and the efficiency it used this with, the time it took to run (wall clock time), the maximum amount of memory used and how much of the memory that I set as a maximum that this was. If you aren't sure how much memory or time to use for a bunch of jobs, you can set one job to run with a much higher time/memory allowance than you expect to use and then check the output. You can try to tailor the amount of memory that you set to what is available for each node - the more nodes on that cluster that are capable of carrying out your job, the quicker it is likely to be run.
You can get more detailed information with: sacct -j $JOBID
You can also run the same thing by submitting all of this from the command line:
sbatch --job-name=job_name.job --output=/home/rwright/scratch/out/job_name.out --error=/home/rwright/scratch/out/job_name.err --mem=25G --time=0-24:00 job_script.sh
You can run jobs interactively with salloc
by running something like:
salloc --time=1:0:0 --ntasks=20 --mem-per-cpu 50G
salloc: Pending job allocation 15038192
salloc: job 15038192 queued and waiting for resources
When the resources for running this are available, you will be able to use terminal to run things with the restrictions that you've set (rather than the 5 minute maximum job time).
You can quit this again by using:
exit
There is more information on running/submitting jobs on Compute Canada here, and a there are a range of other tutorials on using slurm elsewhere, e.g. here.
- Getting Started in the Lab
- Funding Opportunities
- Registering New Computer on Dal Network
- Conferences & Travel
- Setting up a Miniconda Environment
- tmux Quick Reference
- Customizing Matplotlib
- SSH Tunneling to use RStudio Server
- SSH Tunneling to use Jupyter Notebooks