Skip to content
Karl Ehatäht edited this page Jul 18, 2022 · 10 revisions

Table of contents

Debugging memory issues

This section gives a brief overview of the tools that help to recognize and identify problems that are caused by excessive memory usage. All programs mentioned here are expected to work in a CMSSW environment.

The memory footprint can come in many flavors, the most notable of which are the VSIZE (or virtual size) and RSS (or resident set size) memory. In simple terms, one can think of the RSS memory that is "intrinsic" to a given process while VSIZE memory includes additional overhead cost associated with loading all shared libraries and binaries that the program is linked against. In a way, VSIZE overestimates the "true" memory usage while the RSS memory gives a more realistic estimate of the memory footprint. For this reason, many cluster and grid schedulers are configured to monitor the RSS memory of a program, and instantly kill the process if its RSS memory consumption exceeds a certain threshold. The default limit in most schedulers (such as SLURM or grid) is 2GB.

htop

Simply calling htop will bring the user to a table that lists the running processes and threads in the host machine. The table dedicates one row per process (or per thread if the program is multi-threaded) while each column gives more the details about the process (or thread):

  • PID or process ID, that is unique ID to a given process;
  • PRI or priority, equal to niceness + 20;
  • NI or niceness, a number between -20 and +20 (default 0) *;
  • VIRT or virtual size memory;
  • RES or RSS memory;
  • SHR or shared memory;
  • S or status of the process;
  • CPU%;
  • MEM%;
  • TIME+ wall time;
  • Command (modulo inlined environment variables, redirections and pipes).

htop is great for identifying problems in the host machine (especially if someone's misuing it) but not so great when debugging a single program:

  • the update interval can be too short to miss the actual peak memory usage;
  • the user has to search for the program that they are interested in, and might never be able to find it if the process is short-lived.

* Niceness basically tells how agressively the process is configured to consume the CPU and memory resources of the host machine. You can overwrite this setting with the renice command.

time

A better alternative to htop for finding peak memory usage is:

/usr/bin/time --verbose [your command with arguments]

After the program finishes, time will list key bencmarks of the process, including the peak RSS memory usage. This program great for recognizing that there's a memory hog somewhere in your program, but it doesn't tell much about the nature of it, eg it doesn't tell if the memory usage gradually increases over time or if there's a sudden peak.

prmon

In order to understand the memory profile a little bit better, one can use prmon:

/home/software/bin/prmon -i 1 -- [your command with arguments]

The above command records the memory usage once per second (-i 1) and saves them to prmon.txt and prmon.json files (can be reconfigured with -f and -j options, respectively) after the program terminates. Unlike the time command, it's also possible to attach prmon to an already running program (that you yourself started) via -p option.

The log file can be turned into a plot detailing the VSIZE, PSS, RSS memory and swap consumption in units of GB over running time of the program with the following command:

/home/software/bin/prmon_plot.py --input prmon.txt --xvar wtime --yvar vmem,pss,rss,swap --yunit GB

Valgrind

Valgrind bundle refers to a collection of profiling programs, each of which have specific purpose and range of applications. The most useful of those are described here.

If you're trying to debug an FWLite application with any of the Valgrind tools, you need to make sure that the debug symbols in all of your binaries and libraries are enabled. You can do that by rebuilding your project with gdb-friendly compilation flags:

scram b clean
USER_CXXFLAGS="-g -Og" scram b -j8

Note that these compilation flags are also relevant when debugging in gdb. Remember to rebuild your project with the default compilation flags (ie without USER_CXXFLAGS) when running your programs in normal operation.

When debugging a program that runs via cmsRun, you might have to use cmsRunGlibC instead.

massif

This program is very similar to prmon, as it needs an additional tool (massif-visualizer) to visualize the memory consumptiopn over time. The only difference is that massif monitors only the heap memory and keeps track of the calls that allocate to the heap (so anything that's called via the new operator in C++). Because of this additional granuality, programs run in massif take longer to finish compared to prmon (which presumably just calculates the sum of memory maps).

Example:

valgrind --tool=massif --depth=40 --time-stamp=yes --time-unit=ms --threshold=0.1 \
  [your command with arguments]

callgrind

This is another program, not specifically to monitor the memory usage per se, but rather a complementary tool that helps to understand how the call graph of a program is constructed. Similarly to massif and prmon, it also requires an extra program to visualize the call graphs.

Example:

valgrind --tool=callgrind [your command with arguments]

For instance, it has been useful in identifying a problem with MEM integration. At first glance, the issue manifested itself as a memory leak, but upon later inspection with callgrind the program instead entered to an infinite loop that continuously allocated more and more memory because of improper initialization of the integrator.

memcheck

Finally, memcheck is a heavy-weight tool to get a very detailed report on the memory usage of your process. In simple terms, it works by "sandboxing" or redirecting the system calls to its own calls, with the aim to keep track of every minute allocation and free operation. If there are more allocations than there are free operations, then we interpret it as memory leak. This level of extreme granuality can help to identify even the smallest of leaks. Because of this, memcheck takes also very long time to run, and in certain applications (such as in real-time applications) it's even unusable due to the lag it produces.

Most of the shared libraries your program is linked against (ROOT, CMSSW) are necessarily not free from memory leaks, either. In fact, the memory leaks can be so abundant in 3rd party libraries that they can mask a legitimate issue in your own library or executable. For this reason, it's sensible to "suppress" claims of memory leaks that from practical point of view are completely harmless. memcheck can be instructed to ignore leaks from certain libraries with a "supplicant" file. CMSSW provides its own supplicant file via the cmsvgsupp command.

Unlike any of the previous tools mentioned thus far, memcheck does not have, nor does it need an additional tool to visualize the results. Instead, the user is expected to analyze the generated report themself. Valgrind's documentation gives a good overview of how to interpret the output of memcheck.

Example:

valgrind --tool=memcheck `cmsvgsupp` \
--leak-check=yes                     \
--show-reachable=yes                 \
--num-callers=20                     \
--track-fds=yes                      \
--track-origins=yes                  \
--log-file="valgrind.log"            \
[your command with arguments]

igprof

igprof can be viewed as an alternative to memcheck or massif. Seems to be the standard when it comes to profiling submodules in CMSSW. See this Twiki for more.

C++

In ttH analysis repository, we also have a small class, MemoryLogger, that records the VSIZE and RSS memory at a point where it is called. The class can also handle event loops by keeping track of the count of times it has been called. A summary is printed out when its instance goes out of scope, ie when its destructor is called. Here's an example for how to use it:

#include "tthAnalysis/HiggsToTauTau/interface/MemoryLogger.h" // MemoryLogger

{
  MemoryLogger memLogger(2); // print first and last two instances in an event loop
  // ...
  memLogger.log(__LINE__); // record VSIZE and RSS at this line
  // ...
  for(...)
  {
    memLogger.log(__LINE__); // record VSIZE and RSS in the event loop
  }
} // memLogger prints out a summary before it is destroyed

Here's an example output:

This approach can be used if you already know where to look for the leak, and use the logging statements to narrow down the offending piece of code. The downside of using this class is that it is also a source of memory consumption itself, especially if the memory consumption is logged in an event loop spanning over millions of events. A remedy to this is to either use fewer events in the loop, or instruction MemoryLogger to keep only the last measurement with its record_last() call.

Debugging runtime problems

This sections lists references to profiling tools that help to better understand the internal workings of a program. From the following examples, perf and Intel VTune may require elevated permissions or, at the very least, modified values in /proc, and thus can only be run in native environment.

strace

strace is a tool for tracing system calls aka syscalls, ie calls to the kernel. It's very useful for deconstructing an otherwise complicated program that is creating lots of child processes. strace can be attached to a running process via PID, or by running the program from start to finish via itself. Example of the latter:

strace -s 9999 -f -e trace=execve <some command with arguments> &> out.log

And an example output:

In this example, strace follows execve syscalls that tell a current process to replace itself with a new process. This call usually follows after forking which is why there's -f flag that instructs strace to follow forks. Finally, -s 9999 configures strace to show up to 9999 characters of syscall arguments (the default is just 32 characters).

Given that there are many more syscalls that handle communication with the file system, network, memory allocation, permissions, process control, which all require communication with the kernel, strace can be very helpful in better understanding how some aspects of a program work in case the source code is very complicated or unavailable.

gdb

gdb or GNU Debugger is a handy tool in case you are struggling to find the cause for runtime errors such as segmentation violations, out-of-bounds array access etc, and your program is written in C or C++. It works by sandboxing the program and keeping track of each call. If a runtime error occurs, a stacktrace/backtrace (ie the chain of calls in the state of failure) can be printed on screen using the bt command. It can only produce a sensible output (such as function names, arguments and line numbers) if there are enough debug symbols present in the binary of your executable (and also in the shared libraries that are linked to the executable). The debug symbols can be enabled with compilation flag -ggdb3, but it's sometimes useful to accompany it with a flag that optimizes for debugging experience (-Og). If you're debugging a cmsRun of FWLite application, please follow the instructions given in the valgrind section. Minimal example:

$ gdb YourProgram # notice that there are no arguments
> r arg1 arg2 ... # run the program with arguments (if any)
> bt              # after your program crashes
> q               # to quit the gdb session

The above is usually enough to debug simple problems. In more complicated settings, one can move between frames and print variables; utilize breakpoints and stepping into the functions; attach gdb to an already running program etc which won't be covered here.

perf

perf is advanced performance analyzer that helps, among many other things, to determine performance bottlenecks and so-called hotspots in a given program, count cache misses, pull statistics about threads, open files etc. There are many resources available like this one that go into the details on how to use the program. Since this is a command line tool, there are many 3rd party programs that help with visualizing the results.

Intel's VTune

Intel VTune seems to the same what perf already does but with a GUI. Here's an example output after running hotspot analysis on Madgraph NLO event generation, isolating to smatrix_real Fortran subroutine and filtering out the thread that took the longest to run:

The program has a complementary command line interface that can be used to create callgraphs from hotspots analysis, like so:

# run the hotspots analysis, store the results in directory called 'results'
vtune --collect=hotspots --result-dir=results <your command with arguments>
# convert the output stored in directory called 'results' to gprof format and save it to 'results.txt'
vtune --report=gprof-cc --result-dir=results --format=text --report-output=results.txt
# visualize the results
gprof2dot -f axe results.txt | dot -T pdf -o callgraph.pdf

Example from running LO event generation with Madgraph: