Skip to content

JASMIN GPU Guide

Herbie Bradley edited this page Sep 30, 2022 · 22 revisions

JASMIN contains two GPU partitions, a testing partition consisting of 3 GPU nodes - two nodes with 2 Nvidia V100 GPUs each (V100s have 32GB VRAM), and one node with 4 V100s, and a larger partition which is still being opened up for general use. Jobs are submitted through the SLURM queue manager and the GPUs can be used in both interactive (a terminal with access to the GPU) and batch mode (submit a predefined script to the GPU). Currently, you are required to contact JASMIN support in order to use the GPUs.

The most important software on the nodes is:

  • CUDA support, including relatively up to date GPU drivers

  • cuDNN (Deep Neural Network Library)

  • NVIDIA container runtime - this is nvidia-docker, a container runtime that allows access to GPUs

  • Singularity 3.4.1 - a container manager for clusters that supports GPU containers

The simplest way of running programs on the GPU nodes is to first submit a job to the lotus_gpu queue, then within the job activate a conda environment installed in your home directory and run your program. Alternatively, you can open a Singularity container within the job with the dependencies you want and run your program, but this is more complex.

CUDA Compatibility and Machine Learning Libraries

As of the time of writing, ls -l /usr/local | grep cuda from a GPU node shows the list of CUDA versions installed, and that cuda maps to version 10.1. However, this is misleading, since you can use more recent versions via installing packages that come packaged with CUDA, such as PyTorch. nvidia-smi will show in the top right corner the latest version of CUDA compatible with the GPU drivers on the cluster. This means that for installing Pytorch, you want to install the latest version with cudatoolkit version less than the CUDA version in nvidia-smi.

At the time of writing, nvidia-smi shows CUDA 11.6 in the top right (because this is the version supported by the GPU drivers on the cluster), and so to install the latest PyTorch you would use conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch, since 11.3 is the latest CUDA supported by PyTorch. To verify this, run python -c "import torch;print(torch.version.cuda)" once installed and this should print the version of CUDA that PyTorch is using ("11.3" in this example).

Note that the JAX machine learning framework does not come bundled with CUDA so you may need to install the version of Jax compatible with the default JASMIN GPU node CUDA (10.1 as of the time of writing, as described above).

Interactive mode guide

Once you have access to the lotus_gpu queue, you can launch an interactive terminal on a GPU node with the command

salloc --gres=gpu:1 --partition=lotus_gpu --account=lotus_gpu

and then run

srun --pty /bin/bash

You are now in a GPU node (use nvidia-smi to see the GPU available). The full JASMIN filesystem is available from here (home dir, group workspaces, CEDA archive, etc). If you followed the instructions in the Conda configuration tutorial, then conda should automatically activate, and you can run your programs from here on the GPU as you would expect.

Batch mode guide

You can submit job scripts via the command line, simply by prepending a command to run your program with sbatch followed by some flags to customise the job options. However, the preferred way is to submit a script with the SLURM options followed by a sequence of commands to run your program. An example script is shown below:

#!/bin/bash
#SBATCH --partition=lotus_gpu
#SBATCH --account=lotus_gpu
#SBATCH --gres=gpu:1 # Request a number of GPUs
#SBATCH --time=12:00:00 # Set a runtime for the job in HH:MM:SS
#SBATCH --mem=32000 # Set the amount of memory for the job in MB.

conda activate myenv
srun python src/main.py

Note that wrapping your python script in srun within the batch script ensures that the parallel environment is properly setup.

A detailed list of options can be found here, and contains options for specifying the job name, output file, etc. IMPORTANT: note that the GPU queue currently has a maximum runtime of 168 hours. Note that the default memory allocation is 8192MB, so your program will be killed in either interactive or batch mode if it uses more than this. Setting --mem=32000 is a good start.

Libraries that integrate with SLURM

This is a non-exhaustive list of Python libraries designed for GPU work which have SLURM integrations.

Submitit and Hydra

Submitit is a Python library by Facebook AI Research that allows you to submit SLURM jobs from code, and therefore submit multiple jobs programmatically. This may be useful for submitting multiple jobs, checkpointing jobs just before they are timed out and automatically re-queuing, etc. The Hydra library, which is an argparse replacement for Python, has integration with Submitit allowing for easy access to all the SLURM parameters as Python arguments.

Pytorch Lightning

Pytorch Lightning is a wrapper library around Pytorch which greatly reduces the boilerplate for many common tasks, such as parallelising code over multiple GPUs. As described here, Lightning can automatically distribute your program across multiple nodes in a GPU cluster. It can also automatically checkpoint just before your job time is about to expire, requeue the job, and resume from the checkpoint. Lightning also has a Python interface to SLURM somewhat like Submitit, so this is another way to queue multiple jobs from within Python.

Singularity

Singularity is a way of running your programs inside containers on a compute cluster. It has some nice features to be aware of:

  • The home directory is automatically mounted, so unlike Docker you aren't working in the isolated container's filesystem by default. Other directories like the AI4ER group workspace can be mounted manually.
  • Singularity has its own container library, but most importantly it can pull containers from Dockerhub and use them as Singularity containers. This means you will usually want to use standard Docker containers with a Python installation, like the Anaconda/Pytorch/Tensorflow latest.
  • Singularity is designed to easily run jobs across multiple nodes, using MPI. The usual way this is done is to use the MPI installation from outside the container (on the GPU cluster), to automatically distribute tasks inside the container according to the number of nodes and GPUs requested in the SLURM queue.

More details on the benefits of Singularity to do GPU computing are here and here. The Singularity documentation can be found here.

Retrieving Singularity Containers

Although it just saves the container to your home directory, it is sometimes necessary for large containers to download them from the GPU node rather than from the sci server. Use the command

singularity pull <container-name>.sif docker://<docker-container-tag>

Replace docker:// with shub:// to pull from the Singularity container hub.

Running Singularity Containers

There are then two important Singularity commands: shell and exec. Both these commands have two important flags: --nv and -B. The first enables NVIDIA GPU support inside the container - this is VERY IMPORTANT so you should always use it. The second allows you to mount a directory to the container.

shell will run your container and go to an interactive shell inside it. For example:

singularity shell --nv -B /gws/nop4/j04/ai4er pytorch-latest.sif

The above command will open a shell inside the pytorch-latest container which I pulled from Dockerhub, enable GPU support and mount the AI4ER group workspace to the same path (/gws/nop4/j04/ai4er) inside the container, although you can change this path to something simpler like /ai4er by adding a colon between the two paths after the -B flag, e.g. /gws/nop4/j04/ai4er:/ai4er.

exec will launch a container then execute the command appended to it. This is the standard way of running programs in containers without using an interactive shell. For example:

singularity exec --nv pytorch-latest.sif python main.py --mode=train --learning_rate=1e-3 --batch_size=256 --data_dir=/data/imagenet

The above command will run the container with GPU support and execute the python command inside the container with the specified arguments. You may need to install dependencies which are not in your container image, of course - this can be done by making a bash script which installs dependencies then runs the program, and calling the bash script at the end of singularity exec. This singularity exec command is what you would use at the end of your SLURM submission script to run your program.

The above commands can be used in interactive mode and in batch mode (inside a script).