process placement again: emphasize streams, recommend "-map-by numa -…

…bind-to core"
bueler · Jul 18, 2023 · 098e000 · 098e000
1 parent 1b7e163
commit 098e000
Showing 1 changed file with 27 additions and 11 deletions.
diff --git a/HARDWARE.md b/HARDWARE.md
@@ -1,32 +1,48 @@
 Hardware topology: process placement for performance
 ----------------------------------------------------
 
-Telling your MPI to bind and/or map processes to your hardware will improve performance.  See the ["Hints for Performance Tuning" section of the PETSc/TAO User's Manual](https://petsc.org/release/docs/manual/performance/).  First, a good idea is to install [`hwloc`](https://www.open-mpi.org/projects/hwloc/), including the program [`lstopo`](https://www.open-mpi.org/projects/hwloc/lstopo/) which can display your machine's "hardware topology".  (A `hwloc` package may be available from your package manager.)  Then do
+Telling your MPI process manager to bind and/or map processes to your hardware will improve performance.  See the ["Hints for Performance Tuning" section of the PETSc/TAO User's Manual](https://petsc.org/release/docs/manual/performance/).
+
+### Understanding your hardware
+
+First, a good idea is to install [`hwloc`](https://www.open-mpi.org/projects/hwloc/), including the program [`lstopo`](https://www.open-mpi.org/projects/hwloc/lstopo/) which can display your machine's "hardware topology".  (A `hwloc` package may be available from your package manager.)  Then do
 
         $ lstopo
 
 to get a graphical view of the layout of sockets, cores, threads, and cache memory on your system.
 
+Next, try running the [streams benchmark](https://www.cs.virginia.edu/stream/ref.html).  This can be done with or without process-placement options:
+
+        $ cd $PETSC_DIR
+        $ export PETSC_ARCH=linux-c-opt   # use a --with-debugging=0 build
+        $ make streams
+        $ make streams MPI_BINDING="-map-by numa -bind-to core"
+
+The results typically suggest that a memory-bandwidth-limited computation will already saturate the memory bandwidth even for small numbers of processes; see the ["Hints" Manual section](https://petsc.org/release/docs/manual/performance/).  While [streams](https://www.cs.virginia.edu/stream/ref.html) does almost no computation compared to its memory transfers, even the numerical solution of a PDE is often memory-bandwidth-limited.
+
 ### Multisocket example
 
-For multisocket compute nodes, consider this example using a code from Chapter 6.  Make sure to use a `--with-debugging=0` PETSc configuration, and to start do `cd c/ch6/ && make fish`.  Then compare timing of these two runs:
+For multisocket compute nodes, consider this example using a code from Chapter 6.  Then compare timing of these two runs:
 
+        $ cd ~/p4pdes/c/ch6/
+        $ export PETSC_ARCH=linux-c-opt   # use a --with-debugging=0 build
+        $ make fish
         $ mpiexec -n P ./fish -da_refine L -pc_type mg -pc_mg_levels L
-        $ mpiexec -n P --map-by socket --bind-to core ./fish -da_refine L -pc_type mg -pc_mg_levels L
+        $ mpiexec -n P -map-by socket -bind-to core ./fish -da_refine L -pc_type mg -pc_mg_levels L
 
 Generate the timing by adding `-log_view |grep "Time (sec):"`.  Here `P` is at most the number of physical cores on your node and `L` is large enough to give many seconds of run time, but small enough to fit in memory (and your patience).  For example, try `L=9`.  For larger `P` values you may need to set ` -pc_mg_levels L-1` or ` -pc_mg_levels L-2` to further-reduce the depth of the multigrid (V) cycles so that parallel DMDA-based multigrid can solve the coarse grid problem.  (See Chapters 6--8 regarding parallel multigrid.)
 
-An alternative mapping/binding is `--map-by core --bind-to hwthread` as in the next example.  In any case experimentation is in order.
-
-### Single socket example
+Alternative mapping/bindings are:
+  * `-map-by numa -bind-to core`
+  * `-map-by core -bind-to hwthread`
+Experimentation is in order.
 
-For a single socket machine with `P` or fewer cores, e.g. a laptop, and if each core supports multiple hyperthreads, here is a recommended setting for performance:
+### Single node example
 
-        $ mpiexec -n P --map-by core --bind-to hwthread ./fish -da_refine L -pc_type mg -pc_mg_levels L
+For a single node, multi-socket machine (e.g. some workstations), here is a recommended setting for performance:
 
-The author gets a factor-of-two speedup on his laptop, with `P=4` runs on a four-physical-core processor, over the defaults.
+        $ mpiexec -n P -map-by numa -bind-to core ./fish -da_refine L -pc_type mg -pc_mg_levels L
 
 ### More information
 
-_Why_ does such mapping and binding improve performance?  Also, _why_ are these not default settings for `mpiexec`?  Read the [PETSc/TAO User's Manual](https://petsc.org/release/docs/manual/), but also see why processor affinity can effectively reduce cache problems at the [processor affinity wikipedia page](https://en.wikipedia.org/wiki/Processor_affinity).  See also [MPICH Hydra usage](https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager) or the [Open-MPI mpirun man page](https://www.open-mpi.org/doc/current/man1/mpirun.1.php).
-
+_Why_ does such mapping and binding improve performance?  Also, _why_ are these not default settings for `mpiexec`?  Read the [PETSc/TAO User's Manual](https://petsc.org/release/docs/manual/), but also see why processor affinity can effectively reduce cache problems at the [processor affinity wikipedia page](https://en.wikipedia.org/wiki/Processor_affinity).  See also [MPICH Developer Documentation](https://github.com/pmodels/mpich/blob/main/doc/wiki/Index.md) or the [Open-MPI mpirun man page](https://www.open-mpi.org/doc/current/man1/mpirun.1.php).