-
Notifications
You must be signed in to change notification settings - Fork 6
benchmarking
There are benchmarks for HAM-Offload and Intel LEO. The ones for LEO are not built by default, but need to be built via:
$ b2 toolset=intel variant=release -j8 benchmark_intel_leo
Scripts for automated benchmarking and generating figures exist and will be available soon.
When benchmarking, pinning processes and threads to hardware threads is important for reproducible results. For multi-socket hosts with Xeon Phi accelerators, the actual hardware thread, on both sides, is important too. The best communication path is between a CPU and the accelerator directly connected to its PCIe root complex. Assuming a system with two CPUs and one accelerator, then there is one CPU that will have a better communication performance with the accelerator (especially regarding latency). In general, the mapping between CPUs and accelerators should reflect the topology of the PCIe interconnect. The fields physical id
and core id
in /proc/cpuinfo
provide a picture of how your hardware threads, cores, and CPUs are mapped to each other. The CPU (NUMA node) to which a Xeon Phi is connected can be found in /sys/class/mic/mic<number>/device/numa_node
. On the Xeon Phi, the OS core should be avoided. This is the last physical core, whose 4 hardware threads map to the first and the last three logical cores, e.g. 0, 241, 242, 243 on an 7xxx Xeon Phi with 61 physical cores.
In the following example, we assume that mic0
is connected to the second 8-core CPU of the host system.
For measuring the kernel offloading overhead via (Intel) MPI, run the following:
$ mpirun -n 1 -host localhost -env I_MPI_PIN_PROCESSOR_LIST=8 bin/intel-linux/release/inlining-on/threading-multi/benchmark_ham_offload_mpi -c -r 1000000 : -n 1 -host mic0 -env I_MPI_PIN_PROCESSOR_LIST=0 bin/intel-linux/release_mic/inlining-on/threading-multi/benchmark_ham_offload_mpi
For measuring the kernel offloading overhead via SCIF:
$ bin/intel-linux/release/inlining-on/threading-multi/benchmark_ham_offload_scif --ham-process-count 2 --ham-address 0 --ham-cpu-affinity 8 -c -r 1000000 &
$ ssh mic0 env LD_LIBRARY_PATH=$MIC_LD_LIBRARY_PATH `pwd`/bin/intel-linux/release_mic/inlining-on/threading-multi/benchmark_ham_offload_scif --ham-process-count 2 --ham-address 1 --ham-cpu-affinity 1
Result (times are in ns):
HAM-Offload function call runtime:
name average median min max variance std_error relative_std_error conf95_error relative_conf95_error count
call: 1.786359e+03 1.678500e+03 1.626000e+03 3.181820e+05 1.260685e+05 1.122802e-01 6.285422e-05 2.200692e-01 1.231943e-04 10000000
Pinned to the wrong CPU (--ham-cpu-affinity 0
on the host):
$ bin/intel-linux/release/inlining-on/threading-multi/benchmark_ham_offload_scif --ham-process-count 2 --ham-address 0 --ham-cpu-affinity 0 -c -r 1000000 &
$ ssh mic0 env LD_LIBRARY_PATH=$MIC_LD_LIBRARY_PATH `pwd`/bin/intel-linux/release_mic/inlining-on/threading-multi/benchmark_ham_offload_scif --ham-process-count 2 --ham-address 1 --ham-cpu-affinity 1
Result (times are in ns):
HAM-Offload function call runtime:
name average median min max variance std_error relative_std_error conf95_error relative_conf95_error count
call: 2.968820e+03 2.881500e+03 2.086000e+03 2.955695e+06 9.997552e+05 3.161890e-01 1.065033e-04 6.197305e-01 2.087464e-04 10000000
For the help screen, run:
$ bin/intel-linux/release/inlining-on/threading-multi/benchmark_ham_offload_scif --ham-process-count 2 --ham-address 0 -h &
$ ssh mic0 env LD_LIBRARY_PATH=$MIC_LD_LIBRARY_PATH `pwd`/bin/intel-linux/release_mic/inlining-on/threading-multi/benchmark_ham_offload_scif --ham-process-count 2 --ham-address 1
Supported options:
-h [ --help ] Shows this message
-f [ --filename ] arg filename(-prefix) for results
-r [ --runs ] arg (=1000) number of identical inner runs for which the
average time will be computed
--warmup-runs arg (=1) number of number of additional warmup runs
before times are measured
-s [ --size ] arg (=1048576) size of transferred data in byte (multiple of 4)
-a [ --allocate ] benchmark memory allocation/deallocation on
target
-i [ --copy-in ] benchmark data copy to target
-o [ --copy-out ] benchmark data copy from target
-c [ --call ] benchmark function call on target
-m [ --call-mul ] benchmark function call (multiplication) on
target
-y [ --async ] perform benchmark function calls asynchronously