gpu: Add accelerate example

coderefinery · Oct 10, 2024 · e3014c5 · e3014c5
1 parent db97eaa
commit e3014c5
Show file tree

Hide file tree

Showing 5 changed files with 98 additions and 5 deletions.
diff --git a/content/examples/accelerate_cuda.def b/content/examples/accelerate_cuda.def
@@ -0,0 +1,6 @@
+Bootstrap: docker
+From: pytorch/pytorch:2.2.2-cuda12.1-cudnn8-runtime
+
+%post
+
+  pip install accelerate evaluate datasets scipy scikit-learn transformers
diff --git a/content/examples/run_accelerate_parallel.sh b/content/examples/run_accelerate_parallel.sh
@@ -0,0 +1,16 @@
+#!/bin/bash
+#SBATCH --mem=32G
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=1
+#SBATCH --gpus-per-task=2
+#SBATCH --cpus-per-task=12
+#SBATCH --time=00:10:00
+#SBATCH --output=accelerate_run_parallel.out
+
+export OMP_NUM_THREADS=$(( $SLURM_CPUS_PER_TASK / $SLURM_GPUS_ON_NODE ))
+
+apptainer exec --nv accelerate_cuda.sif \
+  torchrun \
+    --nproc_per_node $SLURM_GPUS_ON_NODE \
+    ./nlp_example.py \
+    --mixed_precision fp16
diff --git a/content/examples/run_lammps_indent.sh b/content/examples/run_lammps_indent.sh
@@ -13,7 +13,5 @@ cd indent
 # Load OpenMPI module
 module load openmpi
 
-export PMIX_MCA_gds=hash
-
 # Run simulation
 srun apptainer run ../lammps-openmpi.sif -in in.indent
diff --git a/content/gpus.rst b/content/gpus.rst
@@ -22,9 +22,11 @@ When using NVIDIA's GPUs that use the CUDA-framework the flag is ``--nv``.
 
 As an example, let's get a CUDA-enabled PyTorch-image:
 
-.. code-block:: console
+:download:`accelerate_cuda.def </examples/accelerate_cuda.def>`:
+
+.. literalinclude:: /examples/accelerate-cuda.def
+   :language: singularity
 
-   $ apptainer pull pytorch-cuda.sif docker://docker.io/pytorch/pytorch:2.2.2-cuda12.1-cudnn8-runtime
 
 Now when we launch the image, we can give the image GPU access with
 
@@ -85,6 +87,57 @@ Now when we launch the image, we can give the image GPU access with
       $ apptainer exec --rocm pytorch-rocm.sif python -c 'import torch; print(torch.cuda.is_available())'
       True
 
+
+Example container: Model training with accelerate
+*************************************************
+
+`Accelerate <https://huggingface.co/docs/accelerate/en/index>`__
+is a library designed for running distributed PyTorch code.
+
+Let's create a container that can run a simple training example
+that can utilizes multiple GPUs.
+
+Container starts from an existing container with PyTorch installed
+and installs a few missing Python packages:
+
+:download:`accelerate_cuda.def </examples/accelerate_cuda.def>`:
+
+.. literalinclude:: /examples/accelerate_cuda.def
+   :language: singularity
+
+Submission script that launches the container looks like this:
+
+:download:`run_accelerate_parallel.sh </examples/run_accelerate_parallel.sh>`:
+
+.. literalinclude:: /examples/run_accelerate_parallel.sh
+   :language: slurm
+
+.. tabs::
+
+   .. tab:: Triton (Aalto)
+
+      To build the image:
+
+      .. code-block:: console
+
+         $ srun --mem=32G --cpus-per-task=4 --time=01:00:00 apptainer build accelerate_cuda.sif accelerate_cuda.def
+
+      To run the example:
+
+      .. code-block:: console
+
+         $ sbatch run_accelerate_parallel.sh
+         $ cat accelerate_run.out 
+         Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
+         You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
+         You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
+         epoch 0: {'accuracy': 0.7598039215686274, 'f1': 0.8032128514056225}
+         epoch 1: {'accuracy': 0.8480392156862745, 'f1': 0.8931034482758621}
+         epoch 2: {'accuracy': 0.8406862745098039, 'f1': 0.888507718696398}
+
+Review of this session
+**********************
+
 .. admonition:: Key points to remember
 
    - Code inside the container image needs to support GPU calculations.

diff --git a/content/mpi.rst b/content/mpi.rst
@@ -382,14 +382,18 @@ on materials modeling.
 Let's build a container with LAMMPS in it:
 
 
+:download:`lammps-openmpi.def </examples/lammps-openmpi.def>`:
+
 .. literalinclude:: /examples/lammps-openmpi.def
    :language: singularity
 
 Let's also create a submission script that runs a LAMMPS example
 where an indent will pushes against a material:
 
+:download:`run_lammps_indent.sh </examples/run_lammps_indent.sh>`:
+
 .. literalinclude:: /examples/run_lammps_indent.sh
-   :language: singularity
+   :language: slurm
 
 Now this exact same container can be run in both Triton / Puhti that have
 OpenMPI installed because both clusters use Slurm and InfiniBand
@@ -399,6 +403,14 @@ interconnects.
 
    .. tab:: Triton (Aalto)
 
+      To build the image:
+
+      .. code-block:: console
+
+         $ srun --mem=16G --cpus-per-task=4 --time=01:00:00 apptainer build lammps-openmpi.sif lammps-openmpi.def
+
+      To run the example:
+
       .. code-block:: console
 
          $ export PMIX_MCA_gds=hash
@@ -436,6 +448,14 @@ interconnects.
 
    .. tab:: Puhti (CSC)
 
+      To build the image:
+
+      .. code-block:: console
+
+         $ apptainer build lammps-openmpi.sif lammps-openmpi.def
+
+      To run the example:
+
       .. code-block:: console
 
          $ export PMIX_MCA_gds=hash