Skip to content

Commit

Permalink
gpu: Add accelerate example
Browse files Browse the repository at this point in the history
  • Loading branch information
simo-tuomisto committed Oct 10, 2024
1 parent db97eaa commit e3014c5
Show file tree
Hide file tree
Showing 5 changed files with 98 additions and 5 deletions.
6 changes: 6 additions & 0 deletions content/examples/accelerate_cuda.def
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Bootstrap: docker
From: pytorch/pytorch:2.2.2-cuda12.1-cudnn8-runtime

%post

pip install accelerate evaluate datasets scipy scikit-learn transformers
16 changes: 16 additions & 0 deletions content/examples/run_accelerate_parallel.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/bin/bash
#SBATCH --mem=32G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-task=2
#SBATCH --cpus-per-task=12
#SBATCH --time=00:10:00
#SBATCH --output=accelerate_run_parallel.out

export OMP_NUM_THREADS=$(( $SLURM_CPUS_PER_TASK / $SLURM_GPUS_ON_NODE ))

apptainer exec --nv accelerate_cuda.sif \
torchrun \
--nproc_per_node $SLURM_GPUS_ON_NODE \
./nlp_example.py \
--mixed_precision fp16
2 changes: 0 additions & 2 deletions content/examples/run_lammps_indent.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,5 @@ cd indent
# Load OpenMPI module
module load openmpi

export PMIX_MCA_gds=hash

# Run simulation
srun apptainer run ../lammps-openmpi.sif -in in.indent
57 changes: 55 additions & 2 deletions content/gpus.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,11 @@ When using NVIDIA's GPUs that use the CUDA-framework the flag is ``--nv``.

As an example, let's get a CUDA-enabled PyTorch-image:

.. code-block:: console
:download:`accelerate_cuda.def </examples/accelerate_cuda.def>`:

.. literalinclude:: /examples/accelerate-cuda.def

Check warning on line 27 in content/gpus.rst

View workflow job for this annotation

GitHub Actions / Build and gh-pages

Include file '/home/runner/work/hpc-containers/hpc-containers/content/examples/accelerate-cuda.def' not found or reading it failed
:language: singularity

$ apptainer pull pytorch-cuda.sif docker://docker.io/pytorch/pytorch:2.2.2-cuda12.1-cudnn8-runtime

Now when we launch the image, we can give the image GPU access with

Expand Down Expand Up @@ -85,6 +87,57 @@ Now when we launch the image, we can give the image GPU access with
$ apptainer exec --rocm pytorch-rocm.sif python -c 'import torch; print(torch.cuda.is_available())'
True
Example container: Model training with accelerate
*************************************************

`Accelerate <https://huggingface.co/docs/accelerate/en/index>`__
is a library designed for running distributed PyTorch code.

Let's create a container that can run a simple training example
that can utilizes multiple GPUs.

Container starts from an existing container with PyTorch installed
and installs a few missing Python packages:

:download:`accelerate_cuda.def </examples/accelerate_cuda.def>`:

.. literalinclude:: /examples/accelerate_cuda.def
:language: singularity

Submission script that launches the container looks like this:

:download:`run_accelerate_parallel.sh </examples/run_accelerate_parallel.sh>`:

.. literalinclude:: /examples/run_accelerate_parallel.sh
:language: slurm

.. tabs::

.. tab:: Triton (Aalto)

To build the image:

.. code-block:: console
$ srun --mem=32G --cpus-per-task=4 --time=01:00:00 apptainer build accelerate_cuda.sif accelerate_cuda.def
To run the example:

.. code-block:: console
$ sbatch run_accelerate_parallel.sh
$ cat accelerate_run.out
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
epoch 0: {'accuracy': 0.7598039215686274, 'f1': 0.8032128514056225}
epoch 1: {'accuracy': 0.8480392156862745, 'f1': 0.8931034482758621}
epoch 2: {'accuracy': 0.8406862745098039, 'f1': 0.888507718696398}
Review of this session
**********************

.. admonition:: Key points to remember

- Code inside the container image needs to support GPU calculations.
Expand Down
22 changes: 21 additions & 1 deletion content/mpi.rst
Original file line number Diff line number Diff line change
Expand Up @@ -382,14 +382,18 @@ on materials modeling.
Let's build a container with LAMMPS in it:


:download:`lammps-openmpi.def </examples/lammps-openmpi.def>`:

.. literalinclude:: /examples/lammps-openmpi.def

Check warning on line 387 in content/mpi.rst

View workflow job for this annotation

GitHub Actions / Build and gh-pages

Could not lex literal_block as "singularity". Highlighting skipped.
:language: singularity

Let's also create a submission script that runs a LAMMPS example
where an indent will pushes against a material:

:download:`run_lammps_indent.sh </examples/run_lammps_indent.sh>`:

.. literalinclude:: /examples/run_lammps_indent.sh
:language: singularity
:language: slurm

Now this exact same container can be run in both Triton / Puhti that have
OpenMPI installed because both clusters use Slurm and InfiniBand
Expand All @@ -399,6 +403,14 @@ interconnects.

.. tab:: Triton (Aalto)

To build the image:

.. code-block:: console
$ srun --mem=16G --cpus-per-task=4 --time=01:00:00 apptainer build lammps-openmpi.sif lammps-openmpi.def
To run the example:

.. code-block:: console
$ export PMIX_MCA_gds=hash
Expand Down Expand Up @@ -436,6 +448,14 @@ interconnects.
.. tab:: Puhti (CSC)

To build the image:

.. code-block:: console
$ apptainer build lammps-openmpi.sif lammps-openmpi.def
To run the example:

.. code-block:: console
$ export PMIX_MCA_gds=hash
Expand Down

0 comments on commit e3014c5

Please sign in to comment.