Skip to content

Commit

Permalink
Added report
Browse files Browse the repository at this point in the history
  • Loading branch information
Angelogeb committed Mar 20, 2019
0 parents commit e90fc0b
Showing 1 changed file with 123 additions and 0 deletions.
123 changes: 123 additions & 0 deletions report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
% Performance counters analysis for Hyper-Threading
% Beatrice Bevilacqua, Anxhelo Xhebraj
% March 2019

Performance counters analysis for Hyper-Threading
=================================================


Performance Counters Frameworks
-------------------------------
The complexity of newer architectures has led to the necessity of
a better knowledge of the underlying hardware in order to get peak
performance. Following these trends new interfaces have been made
available to developers for spotting performance bottlenecks in their
applications such as Performance Monitoring Units (PMU).

PMUs enable developers to observe and count events in the CPU such as
branch mispredictions, cache misses and other finer grained details over
the whole pipeline. Although powerful, dealing with such information
remains burdensome given the diversity of the events, making it difficult
to truly identify optimization opportunities.
Depending by the processor family, on average 4 counters can be read
contemporarily at any time using Model Specific Registers. In order
to read more than 4 events, various tools multiplex such registers
in a *time-sharing* fashion.

Many tools for performance analysis based on PMUs have been developed
ranging from *raw* event count to more sofisticated and aggregated
measures as follows:

* `msr`: direct access to the device files `/dev/cpu/*/msr`
* [PAPI] : A Performance Application Programming Interface that
offers a set of APIs for using performance counters.
Supports multiple architectures and multiplexing.
* [likwid] : A suite of applications and libraries for analysing
High Performance Computing applications. It
contains out of the box utilies to work with MPI,
power profiling and architecture topology.
* [Intel Vtune Amplifier] : Application for performance analysis on
intel architectures. Gives insights regarding possible bottlenecks
of the application annotating its source code and provides
possible solutions.
* [perf] : In a similar vein to Intel Vtune Amplifier shows which
functions are more critical to the application. Additionally
provides more high level information such as I/O and Networking.
It is possible to analyse raw hardware performance counters but
its main goal is abstracting over them.
* [pmu-tools] : is a collection of tools for profile collection
and performance analysis on Intel CPUs on top of Linux perf


`likwid`
--------
Given that the goal of this document is to analyze system behaviour
through performance counters to provide insights regarding new
possible scheduling strategies in Hyper-Threading systems, we choose
to use the `likwid` applications and libraries for our task. The choice
was especially driven by the presence of useful benchmarks in the `likwid`
repository for stressing FPU and other core subsystems. Additionally
Intel Vtune Amplifier was used to profile the benchmarks in order to
characterize their workload.

`likwid-perfctr -e` allows to query all the available events for
the current architecture while `likwid-perfctr -a` shows the pre-configured
event sets, called performance groups, with useful pre-selected event
sets and derived metrics. Multiple modes of execution of performance monitoring
are available as documented in the `likwid` wiki. Of main interest are
**wrapper mode** and **timeline mode**. The former produces a summary of the
events, while the latter outputs performance metrics at a specified
frequency (specified through the `-t` flag).
In case multiple groups need to be monitored multiplexing is performed
at the granularity set through the `-t` flag (in timeline mode, otherwise
`-T` for wrapper mode) and the output produced are the id of the group read
at a given timestep and its values.

>Tests have shown that for measurements below 100 milliseconds, the
periodically printed results are not valid results anymore (they are higher
than expected) but the behavior of the results is still valid. E.g. if you
try to resolve the burst memory transfers, you need results for small
intervals. The memory bandwidth for each measurement may be higher than
expected (could even be higher than the theoretical maximum of the machine)
but the burst and non-burst traffic is clearly identifiable by highs and
lows of the memory bandwidth results.


Benchmarks
----------

The benchmark available in `likwid` can be run through the `likwid-bench`
command. For an overview of the available benchmarks run `likwid-bench -a`.
All benchmarks perform operations over one-dimensional arrays. The benchmarks
used in our setting are:

* `ddot_sp`: Single-precision dot product of two vectors, only scalar
operations
* `copy`: Double-precision vector copy, only scalar operations
* `ddot_sp_avx`: Single-precision dot product of two vectors, optimized for AVX
* `sum_int`: Custom benchmark similar to `sum` but working on integers

All benchmarks are run with multiple configurations of number of threads (with or
without Hyper-Threading), processor frequencies with TurboBoost disabled, working
set size. The latter is needed in order to emulate *core-bound* executions
(working set fitting in cache) and *memory-bound* ones.


Details
-------

The tests were run on a Dell XPS 9750 with i7-8750H. With TurboBoost disabled
the available frequencies range from 1.0 to 2.2 GHz. There is one socket with
6 Physical cores and 12 Logical cores (in Hyper Threading).

[PAPI]: http://icl.utk.edu/papi/
[PAPI]: https://bitbucket.org/icl/papi.git
[PAPI]: http://icl.utk.edu/projects/papi/wiki/PAPIC:Overview

[likwid]: https://github.com/RRZE-HPC/likwid

[Intel Vtune amplifier]: https://software.intel.com/en-us/vtune

[perf]: http://www.brendangregg.com/perf.html

[pmu-tools]: https://github.com/andikleen/pmu-tools

0 comments on commit e90fc0b

Please sign in to comment.