Skip to content

Latest commit

 

History

History
188 lines (125 loc) · 5.33 KB

README.md

File metadata and controls

188 lines (125 loc) · 5.33 KB

Machine learning benchmarks

Collection of various machine learning benchmarks together with Slurm scripts for CSC's supercomputers.

The benchmarks themselves (Python code) can be found in the benchmarks directory. Main run scripts are in the root directory as *.sh files. The Slurm settings have been separated into their own scripts in the slurm directory.

Typical usage would be to first select a benchmark (e.g., PyTorch synthetic) and then appropriate Slurm settings (e.g., Mahti with 4 GPUs on Mahti, single node, no MPI). The command would then be:

sbatch slurm/mahti-gpu4.sh pytorch-synthetic.sh

Available run scripts

Slurm run scripts can be found in the slurm directory, these are named as [puhti|mahti]-[cpu|gpu]N.sh where N is the number of CPUs or GPUs reserved.

Scripts are all single-node, single MPI task unless it ends with -mpi.sh. Tasks with the -mpi.sh ending launch a separate MPI task for each GPU, assuming 4 GPUs per node. For example mahti-gpu8-mpi.sh reserves two nodes, with 4 GPUs (and thus 4 MPI tasks) per node, giving a total of 8 GPUs (and 8 MPI tasks).

Available benchmarks

Benchmark Script name Data
PyTorch synthetic pytorch-synthetic.sh synthetic
PyTorch DDP pytorch-ddp.sh synthetic/ImageNet
PyTorch DDP Lightning pytorch-ddp-lightning.sh synthetic/ImageNet
PyTorch DeepSpeed pytorch-deepspeed.sh synthetic/ImageNet
run_clm pytorch-clm.sh WikiText-2
TensorFlow CNN tensorflow-cnn.sh synthetic/ImageNet

The different benchmarks are described below in more detail.

PyTorch synthetic

Originally based on Horovod's example script with the same name. Note that the original script used a single fixed random batch which was feed to the network again and again. Some systems and setups are able to optimize this scenario giving very unrealistic results. We have modified the script to generate a new random batch each time.

Runs with "resnet50" model by default, but also supports "inception_v3" and other models from torchvision.models.

Run example with single GPU:

sbatch slurm/mahti-gpu1.sh pytorch-synthetic.sh

Run example with 4 GPUs. Note that you can also add arguments to be given to the Python script:

sbatch slurm/mahti-gpu4.sh pytorch-synthetic.sh --batch-size=32

Using 8 GPUs (i.e., 2 nodes) with Horovod and MPI (not supported in newer PyTorch installations):

sbatch slurm/mahti-gpu8-mpi.sh pytorch-synthetic.sh

PyTorch DDP

PyTorch benchmark using Distributed Data Parallel for handling multiple GPUs.

PyTorch DDP results chart

Run example with 4 GPUs on Puhti using synthetic data:

sbatch slurm/puhti-gpu4.sh pytorch-ddp.sh

Run example with 8 GPUs (on 2 nodes) using real ImageNet data:

sbatch slurm/puhti-gpu8.sh pytorch-ddp.sh --data

Run example with 8 GPUs (2 nodes) with fp16:

sbatch slurm/puhti-gpu8.sh pytorch-ddp.sh --fp16

PyTorch DDP with Lightning

PyTorch Lightning example using DDP. Runs with "resnet50" model by default, but also supports "inception_v3" and other models from torchvision.models.

PyTorch DDP Lightning results chart

DDP on Lightning (as of PyTorch 1.13) needs to be run as single task per GPU:

sbatch slurm/puhti-gpu4-mpi.sh pytorch-ddp-lightning.sh  # single node
sbatch slurm/puhti-gpu8-mpi.sh pytorch-ddp-lightning.sh  # two nodes

The scripts supports --data option to use real ImageNet data instead of synthetic data and --fp16 to enable 16-bit precision for some operations.

PyTorch DeepSpeed

PyTorch deepspeed results chart

DeepSpeed example, 4 GPUs with synthetic data (note: one node = one task):

sbatch slurm/puhti-gpu4.sh pytorch-deepspeed.sh

8 GPUs, 2 nodes with ImageNet data (note one GPU = one task):

sbatch slurm/puhti-gpu8-mpi.sh pytorch-deepspeed.sh --data

run_clm

Fine-tuning GPT-like model on WikiText-2, directly from Huggingface Language modeling examples.

PyTorch run_clm results chart

Run example with a full node GPUs (in this case 8 GPUs on LUMI):

sbatch slurm/lumi-gpu8.sh pytorch-clm.sh

Run example with two full nodes GPUs (in this case 16 GPUs on LUMI):

sbatch slurm/lumi-gpu16.sh pytorch-clm.sh

TensorFlow CNN

Uses tf_cnn_benchmarks.py directly from TensorFlow's GitHub (as a git submodule here).

Run example:

sbatch slurm/mahti-gpu1.sh tensorflow-cnn.sh

Horovod:

sbatch slurm/mahti-gpu8-mpi.sh tensorflow-cnn.sh

With real data:

sbatch slurm/mahti-gpu1.sh tensorflow-cnn.sh --data

Horovod with real data:

sbatch slurm/mahti-gpu8-mpi.sh tensorflow-cnn.sh --data