Telling your MPI process manager to bind and/or map processes to your hardware will improve performance. See the "Hints for Performance Tuning" section of the PETSc/TAO User's Manual.
First, a good idea is to install hwloc
, including the program lstopo
which can display your machine's "hardware topology". (A hwloc
package may be available from your package manager.) Then do
$ lstopo
to get a graphical view of the layout of sockets, cores, threads, and cache memory on your system.
Next, try running the streams benchmark. This can be done with or without process-placement options:
$ cd $PETSC_DIR
$ export PETSC_ARCH=linux-c-opt # use a --with-debugging=0 build
$ make streams
$ make streams MPI_BINDING="-map-by numa -bind-to core"
The results typically suggest that a memory-bandwidth-limited computation will already saturate the memory bandwidth even for small numbers of processes; see the "Hints" Manual section. While streams does almost no computation compared to its memory transfers, even the numerical solution of a PDE is often memory-bandwidth-limited.
For multisocket compute nodes, consider this example using a code from Chapter 6. Then compare timing of these two runs:
$ cd ~/p4pdes/c/ch6/
$ export PETSC_ARCH=linux-c-opt # use a --with-debugging=0 build
$ make fish
$ mpiexec -n P ./fish -da_refine L -pc_type mg -pc_mg_levels L
$ mpiexec -n P -map-by socket -bind-to core ./fish -da_refine L -pc_type mg -pc_mg_levels L
Generate the timing by adding -log_view |grep "Time (sec):"
. Here P
is at most the number of physical cores on your node and L
is large enough to give many seconds of run time, but small enough to fit in memory (and your patience). For example, try L=9
. For larger P
values you may need to set -pc_mg_levels L-1
or -pc_mg_levels L-2
to further-reduce the depth of the multigrid (V) cycles so that parallel DMDA-based multigrid can solve the coarse grid problem. (See Chapters 6--8 regarding parallel multigrid.)
Alternative mapping/bindings are:
-map-by numa -bind-to core
-map-by core -bind-to hwthread
Experimentation is in order.
For a single node, multi-socket machine (e.g. some workstations), here is a recommended setting for performance:
$ mpiexec -n P -map-by numa -bind-to core ./fish -da_refine L -pc_type mg -pc_mg_levels L
Why does such mapping and binding improve performance? Also, why are these not default settings for mpiexec
? Read the PETSc/TAO User's Manual, but also see why processor affinity can effectively reduce cache problems at the processor affinity wikipedia page. See also MPICH Developer Documentation or the Open-MPI mpirun man page.