Skip to content

Commit

Permalink
Docs: Lonestar6 GPUs (TACC) (#3673)
Browse files Browse the repository at this point in the history
Add a documentation on how to run on Lonestar6 GPUs (TACC).
  • Loading branch information
ax3l authored Sep 23, 2024
1 parent 27a7248 commit d0c3040
Show file tree
Hide file tree
Showing 5 changed files with 408 additions and 0 deletions.
1 change: 1 addition & 0 deletions Docs/source/install/hpc.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ This section documents quick-start guides for a selection of supercomputers that
hpc/lassen
hpc/lawrencium
hpc/leonardo
hpc/lonestar6
hpc/lumi
hpc/lxplus
hpc/ookami
Expand Down
139 changes: 139 additions & 0 deletions Docs/source/install/hpc/lonestar6.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
.. _building-lonestar6:

Lonestar6 (TACC)
================

The `Lonestar6 cluster <https://portal.tacc.utexas.edu/user-guides/lonestar6>`_ is located at `TACC <https://www.tacc.utexas.edu>`__.


Introduction
------------

If you are new to this system, **please see the following resources**:

* `TACC user guide <https://portal.tacc.utexas.edu/user-guides/>`__
* Batch system: `Slurm <https://portal.tacc.utexas.edu/user-guides/lonestar6#job-management>`__
* `Jupyter service <https://tacc.github.io/ctls2017/docs/intro_to_python/intro_to_python_011_jupyter.html>`__
* `Filesystem directories <https://portal.tacc.utexas.edu/user-guides/lonestar6#managing-files-on-lonestar6>`__:

* ``$HOME``: per-user home directory, backed up (10 GB)
* ``$WORK``: per-user production directory, not backed up, not purged, Lustre (1 TB)
* ``$SCRATCH``: per-user production directory, not backed up, purged every 10 days, Lustre (no limits, 8PByte total)


Installation
------------

Use the following commands to download the WarpX source code and switch to the correct branch:

.. code-block:: bash
git clone https://github.com/ECP-WarpX/WarpX.git $WORK/src/warpx
We use system software modules, add environment hints and further dependencies via the file ``$HOME/lonestar6_warpx_a100.profile``.
Create it now:

.. code-block:: bash
cp $HOME/src/warpx/Tools/machines/lonestar6-tacc/lonestar6_warpx_a100.profile.example $HOME/lonestar6_warpx_a100.profile
.. dropdown:: Script Details
:color: light
:icon: info
:animate: fade-in-slide-down

.. literalinclude:: ../../../../Tools/machines/lonestar6-tacc/lonestar6_warpx_a100.profile.example
:language: bash

Edit the 2nd line of this script, which sets the ``export proj=""`` variable.
For example, if you are member of the project ``abcde``, then run ``nano $HOME/lonestar6_warpx_a100.profile`` and edit line 2 to read:

.. code-block:: bash
export proj="abcde"
Exit the ``nano`` editor with ``Ctrl`` + ``O`` (save) and then ``Ctrl`` + ``X`` (exit).

.. important::

Now, and as the first step on future logins to Lonestar6, activate these environment settings:

.. code-block:: bash
source $HOME/lonestar6_warpx_a100.profile
Finally, since Lonestar6 does not yet provide software modules for some of our dependencies, install them once:

.. code-block:: bash
bash $HOME/src/warpx/Tools/machines/lonestar6-tacc/install_a100_dependencies.sh
source ${SW_DIR}/venvs/warpx-a100/bin/activate
.. dropdown:: Script Details
:color: light
:icon: info
:animate: fade-in-slide-down

.. literalinclude:: ../../../../Tools/machines/lonestar6-tacc/install_a100_dependencies.sh
:language: bash


.. _building-lonestar6-compilation:

Compilation
-----------

Use the following :ref:`cmake commands <building-cmake>` to compile the application executable:

.. code-block:: bash
cd $HOME/src/warpx
rm -rf build_pm_gpu
cmake -S . -B build_gpu -DWarpX_COMPUTE=CUDA -DWarpX_FFT=ON -DWarpX_HEFFTE=ON -DWarpX_QED_TABLE_GEN=ON -DWarpX_DIMS="1;2;RZ;3"
cmake --build build_gpu -j 16
The WarpX application executables are now in ``$HOME/src/warpx/build_gpu/bin/``.
Additionally, the following commands will install WarpX as a Python module:

.. code-block:: bash
cd $HOME/src/warpx
rm -rf build_pm_gpu_py
cmake -S . -B build_gpu_py -DWarpX_COMPUTE=CUDA -DWarpX_FFT=ON -DWarpX_HEFFTE=ON -DWarpX_QED_TABLE_GEN=ON -DWarpX_APP=OFF -DWarpX_PYTHON=ON -DWarpX_DIMS="1;2;RZ;3"
cmake --build build_gpu_py -j 16 --target pip_install
Now, you can :ref:`submit Lonestar6 compute jobs <running-cpp-lonestar6>` for WarpX :ref:`Python (PICMI) scripts <usage-picmi>` (:ref:`example scripts <usage-examples>`).
Or, you can use the WarpX executables to submit Lonestar6 jobs (:ref:`example inputs <usage-examples>`).
For executables, you can reference their location in your :ref:`job script <running-cpp-lonestar6>` or copy them to a location in ``$WORK`` or ``$SCRATCH``.


.. _running-cpp-lonestar6:

Running
-------

.. _running-cpp-lonestar6-A100-GPUs:

A100 GPUs (40 GB)
^^^^^^^^^^^^^^^^^

`84 GPU nodes, each with 2 A100 GPUs (40 GB) <https://portal.tacc.utexas.edu/user-guides/lonestar6#system-gpu>`__.

The batch script below can be used to run a WarpX simulation on multiple nodes (change ``-N`` accordingly) on the supercomputer lonestar6 at tacc.
Replace descriptions between chevrons ``<>`` by relevant values, for instance ``<input file>`` could be ``plasma_mirror_inputs``.
Note that we run one MPI rank per GPU.


.. literalinclude:: ../../../../Tools/machines/lonestar6-tacc/lonestar6_a100.sbatch
:language: bash
:caption: You can copy this file from ``Tools/machines/lonestar6-tacc/lonestar6_a100.sbatch``.

To run a simulation, copy the lines above to a file ``lonestar6.sbatch`` and run

.. code-block:: bash
sbatch lonestar6_a100.sbatch
to submit the job.
168 changes: 168 additions & 0 deletions Tools/machines/lonestar6-tacc/install_a100_dependencies.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
#!/bin/bash
#
# Copyright 2023 The WarpX Community
#
# This file is part of WarpX.
#
# Author: Axel Huebl
# License: BSD-3-Clause-LBNL

# Exit on first error encountered #############################################
#
set -eu -o pipefail


# Check: ######################################################################
#
# Was lonestar6_warpx_a100.profile sourced and configured correctly?
if [ -z ${proj-} ]; then echo "WARNING: The 'proj' variable is not yet set in your lonestar6_warpx_a100.profile file! Please edit its line 2 to continue!"; exit 1; fi


# Remove old dependencies #####################################################
#
SW_DIR="${WORK}/sw/lonestar6/sw/lonestar6/a100"
rm -rf ${SW_DIR}
mkdir -p ${SW_DIR}

# remove common user mistakes in python, located in .local instead of a venv
python3 -m pip uninstall -qq -y pywarpx
python3 -m pip uninstall -qq -y warpx
python3 -m pip uninstall -qqq -y mpi4py 2>/dev/null || true


# General extra dependencies ##################################################
#

# tmpfs build directory: avoids issues often seen with $HOME and is faster
build_dir=$(mktemp -d)

# c-blosc (I/O compression)
if [ -d $HOME/src/c-blosc ]
then
cd $HOME/src/c-blosc
git fetch --prune
git checkout v1.21.1
cd -
else
git clone -b v1.21.1 https://github.com/Blosc/c-blosc.git $HOME/src/c-blosc
fi
rm -rf $HOME/src/c-blosc-a100-build
cmake -S $HOME/src/c-blosc -B ${build_dir}/c-blosc-a100-build -DBUILD_TESTS=OFF -DBUILD_BENCHMARKS=OFF -DDEACTIVATE_AVX2=OFF -DCMAKE_INSTALL_PREFIX=${SW_DIR}/c-blosc-1.21.1
cmake --build ${build_dir}/c-blosc-a100-build --target install --parallel 16
rm -rf ${build_dir}/c-blosc-a100-build

# ADIOS2
if [ -d $HOME/src/adios2 ]
then
cd $HOME/src/adios2
git fetch --prune
git checkout v2.8.3
cd -
else
git clone -b v2.8.3 https://github.com/ornladios/ADIOS2.git $HOME/src/adios2
fi
rm -rf $HOME/src/adios2-a100-build
cmake -S $HOME/src/adios2 -B ${build_dir}/adios2-a100-build -DADIOS2_USE_Blosc=ON -DADIOS2_USE_Fortran=OFF -DADIOS2_USE_Python=OFF -DADIOS2_USE_ZeroMQ=OFF -DCMAKE_INSTALL_PREFIX=${SW_DIR}/adios2-2.8.3
cmake --build ${build_dir}/adios2-a100-build --target install -j 16
rm -rf ${build_dir}/adios2-a100-build

# BLAS++ (for PSATD+RZ)
if [ -d $HOME/src/blaspp ]
then
cd $HOME/src/blaspp
git fetch --prune
git checkout v2024.05.31
cd -
else
git clone -b v2024.05.31 https://github.com/icl-utk-edu/blaspp.git $HOME/src/blaspp
fi
rm -rf $HOME/src/blaspp-a100-build
cmake -S $HOME/src/blaspp -B ${build_dir}/blaspp-a100-build -Duse_openmp=OFF -Dgpu_backend=cuda -DCMAKE_CXX_STANDARD=17 -DCMAKE_INSTALL_PREFIX=${SW_DIR}/blaspp-2024.05.31
cmake --build ${build_dir}/blaspp-a100-build --target install --parallel 16
rm -rf ${build_dir}/blaspp-a100-build

# LAPACK++ (for PSATD+RZ)
if [ -d $HOME/src/lapackpp ]
then
cd $HOME/src/lapackpp
git fetch --prune
git checkout v2024.05.31
cd -
else
git clone -b v2024.05.31 https://github.com/icl-utk-edu/lapackpp.git $HOME/src/lapackpp
fi
rm -rf $HOME/src/lapackpp-a100-build
CXXFLAGS="-DLAPACK_FORTRAN_ADD_" cmake -S $HOME/src/lapackpp -B ${build_dir}/lapackpp-a100-build -DCMAKE_CXX_STANDARD=17 -Dbuild_tests=OFF -DCMAKE_INSTALL_RPATH_USE_LINK_PATH=ON -DCMAKE_INSTALL_PREFIX=${SW_DIR}/lapackpp-2024.05.31
cmake --build ${build_dir}/lapackpp-a100-build --target install --parallel 16
rm -rf ${build_dir}/lapackpp-a100-build

# heFFTe
if [ -d $HOME/src/heffte ]
then
cd $HOME/src/heffte
git fetch --prune
git checkout v2.4.0
cd -
else
git clone -b v2.4.0 https://github.com/icl-utk-edu/heffte.git ${HOME}/src/heffte
fi
rm -rf ${HOME}/src/heffte-a100-build
cmake \
-S ${HOME}/src/heffte \
-B ${build_dir}/heffte-a100-build \
-DBUILD_SHARED_LIBS=ON \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CXX_STANDARD=17 \
-DCMAKE_INSTALL_RPATH_USE_LINK_PATH=ON \
-DCMAKE_INSTALL_PREFIX=${SW_DIR}/heffte-2.4.0 \
-DHeffte_DISABLE_GPU_AWARE_MPI=OFF \
-DHeffte_ENABLE_AVX=OFF \
-DHeffte_ENABLE_AVX512=OFF \
-DHeffte_ENABLE_FFTW=OFF \
-DHeffte_ENABLE_CUDA=ON \
-DHeffte_ENABLE_ROCM=OFF \
-DHeffte_ENABLE_ONEAPI=OFF \
-DHeffte_ENABLE_MKL=OFF \
-DHeffte_ENABLE_DOXYGEN=OFF \
-DHeffte_SEQUENTIAL_TESTING=OFF \
-DHeffte_ENABLE_TESTING=OFF \
-DHeffte_ENABLE_TRACING=OFF \
-DHeffte_ENABLE_PYTHON=OFF \
-DHeffte_ENABLE_FORTRAN=OFF \
-DHeffte_ENABLE_SWIG=OFF \
-DHeffte_ENABLE_MAGMA=OFF
cmake --build ${build_dir}/heffte-a100-build --target install --parallel 16
rm -rf ${build_dir}/heffte-a100-build


# Python ######################################################################
#
python3 -m pip install --upgrade pip
python3 -m pip install --upgrade virtualenv
python3 -m pip cache purge
rm -rf ${SW_DIR}/venvs/warpx-a100
python3 -m venv ${SW_DIR}/venvs/warpx-a100
source ${SW_DIR}/venvs/warpx-a100/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install --upgrade build
python3 -m pip install --upgrade packaging
python3 -m pip install --upgrade wheel
python3 -m pip install --upgrade setuptools
python3 -m pip install --upgrade cython
python3 -m pip install --upgrade numpy
python3 -m pip install --upgrade pandas
python3 -m pip install --upgrade scipy
python3 -m pip install --upgrade mpi4py --no-cache-dir --no-build-isolation --no-binary mpi4py
python3 -m pip install --upgrade openpmd-api
python3 -m pip install --upgrade matplotlib
python3 -m pip install --upgrade yt
# install or update WarpX dependencies
python3 -m pip install --upgrade -r $HOME/src/warpx/requirements.txt
#python3 -m pip install --upgrade cupy-cuda12x # CUDA 12 compatible wheel
# optimas (based on libEnsemble & ax->botorch->gpytorch->pytorch)
#python3 -m pip install --upgrade torch # CUDA 12 compatible wheel
#python3 -m pip install --upgrade optimas[all]


# remove build temporary directory
rm -rf ${build_dir}
41 changes: 41 additions & 0 deletions Tools/machines/lonestar6-tacc/lonestar6_a100.sbatch
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
#!/bin/bash -l

# Copyright 2021-2022 Axel Huebl, Kevin Gott
#
# This file is part of WarpX.
#
# License: BSD-3-Clause-LBNL

#SBATCH -t 00:10:00
#SBATCH -N 2
#SBATCH -J WarpX
# note: <proj> must end on _g
#SBATCH -A <proj>
#SBATCH -q regular
#SBATCH -C gpu
#SBATCH --exclusive
#SBATCH --gpu-bind=none
#SBATCH --gpus-per-node=4
#SBATCH -o WarpX.o%j
#SBATCH -e WarpX.e%j

# executable & inputs file or python interpreter & PICMI script here
EXE=./warpx
INPUTS=inputs_small

# pin to closest NIC to GPU
export MPICH_OFI_NIC_POLICY=GPU

# threads for OpenMP and threaded compressors per MPI rank
export SRUN_CPUS_PER_TASK=32

# depends on https://github.com/ECP-WarpX/WarpX/issues/2009
#GPU_AWARE_MPI="amrex.the_arena_is_managed=0 amrex.use_gpu_aware_mpi=1"
GPU_AWARE_MPI=""

# CUDA visible devices are ordered inverse to local task IDs
# Reference: nvidia-smi topo -m
srun --cpu-bind=cores bash -c "
export CUDA_VISIBLE_DEVICES=\$((3-SLURM_LOCALID));
${EXE} ${INPUTS} ${GPU_AWARE_MPI}" \
> output.txt
Loading

0 comments on commit d0c3040

Please sign in to comment.