Skip to content

Benchmark convolution implementations in C++ with Catch2 visualized with Vega-Lite

License

Notifications You must be signed in to change notification settings

JohT/convolution-benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Benchmark Convolution Implementations

This repository takes different C++ implementations of the convolution algorithm and provides fully automated benchmarks visualized as charts. It might also be used as a starting point or template for other C++ benchmark projects.

๐Ÿš€ Features

๐Ÿ“– Related Blog Articles

๐Ÿ“ˆ Results

BenchmarkCharts.md contains the benchmark results visualized as bar charts.

โš’๏ธ Tools

  • Needs CMake (e.g. with the Visual Studio Installer) to build the project.
  • Needs node-canvas dependencies to create SVG vector graphics files.
  • Needs nodejs to build the JavaScript based charts.
  • Recommends Ninja as "a small build system with a focus on speed".
  • Uses cpm as a "Setup-free CMake dependency management" for C++.
  • Uses Catch2 as Unit-Test and Benchmark-Framework for C++.
  • Uses vega-lite to visualize the benchmark results as a bar chart.
  • Uses GitHub Actions to fully automate build, benchmarks and visualization.
  • Uses Renovate to update the dependencies automatically.

โšก๏ธ Commands

โš™๏ธ Compiler Options for vectorization reports

The following compiler options are used to get detailed vectorization reports. They are defined in CompilerOptions.cmake.

CLang

These compile options are used as described in Auto-Vectorization in LLVM and Clang command line argument reference:

  • -Rpass=loop-vectorize identifies loops that were successfully vectorized.
  • -Rpass-missed=loop-vectorize identifies loops that failed vectorization and indicates if vectorization was specified.
  • -Rpass-analysis=loop-vectorize identifies the statements that caused vectorization to fail. If in addition -fsave-optimization-record is provided, multiple causes of vectorization failure may be listed (this behavior might change in the future).
  • -fsave-optimization-record generate a YAML optimization record file.

GCC

These compile options are used with GCC as described in GCC Developer Options:

  • -fopt-info-vec-optimized prints information when an optimization from vectorization passes is successfully.
  • -fopt-info-vec-missed prints information about missed optimization opportunities from vectorization passes.

MSVC

These compile options are used with MSVC as described in Auto-Vectorizer Reporting Level and Auto-Parallelization and Auto-Vectorization:

  • /Qvec-report:2 outputs an informational message for loops that are vectorized and for loops that are not vectorized, together with a reason code.

  • /arch:AVX specifies the architecture for code generation (here AVX).

๐Ÿ”ญ Online Tools

Godbolt Compiler Explorer

Godbolt Compiler Explorer is great tool to analyze small blocks of code and get insights in what the compiler does. Its ideal to try things out. The following two links show an easy yet fast convolution implementation, that works well with compiler optimization and vectorization.

Quick C++ Benchmark

Quick C++ Benchmark is another great tool that compares the performance of two code blocks with each other.

Copy & Paste the following code snipped into Quick C++ Benchmark. It shows the difference between the kernel and the input values as outer loop with a "direct form transposed" convolution implementation.

Code Snipped to compare the iteration over the kernel values in the inner vs. the outer loop
#include <vector>
#include <span>
#include <random>

template<typename T>
std::vector<T> randomNumbers(int n, T min, T max)
{
    std::vector<T> vectorWithRandomNumbers(n, T());
    std::random_device randomDevice;
    std::mt19937 gen(randomDevice());
    std::uniform_real_distribution<T> distribution(min, max);
    for (T &value : vectorWithRandomNumbers)
    {
        value = distribution(gen);
    }
    return vectorWithRandomNumbers;
}

template<typename ValueType>
void convolutionKernelInner(const std::span<const ValueType> &input, const std::span<const ValueType> &kernel, const std::span<ValueType> &output)
{
    const int inputLength = input.size();
    const int kernelLength = kernel.size();
    
    for (auto inputIndex = 0; inputIndex < inputLength; ++inputIndex)
    {
      const auto inputValue = input[inputIndex];
      for (auto kernelIndex = 0; kernelIndex < kernelLength; ++kernelIndex)
        {
            output[inputIndex + kernelIndex] += inputValue * kernel[kernelIndex];
        }
    }
}

template<typename ValueType>
void convolutionKernelOuter(const std::span<const ValueType> &input, const std::span<const ValueType> &kernel, const std::span<ValueType> &output)
{
    const auto inputLength = input.size();
    const auto kernelLength = kernel.size();
    
    for (auto kernelIndex = 0; kernelIndex < kernelLength; ++kernelIndex)
    {
        const auto kernelValue = kernel[kernelIndex];
        for (auto inputIndex = 0; inputIndex < inputLength; ++inputIndex)
        {
            output[inputIndex + kernelIndex] += input[inputIndex] * kernelValue;
        }
    }
}

static void kernelInner(benchmark::State& state) {
  // Code before the loop is not measured
  const auto & input = randomNumbers(16384, -1.0F, 1.0F);
  const auto & kernel = randomNumbers(16, 0.0F, 1.0F);

  const auto convolutionLength = input.size() + kernel.size() - 1;
  auto output = std::vector<float>(convolutionLength, 0.0F);
    
  // Code inside this loop is measured repeatedly
  for (auto _ : state) {
    convolutionKernelInner(std::span(input), std::span(kernel), std::span(output));
    // Make sure the variable is not optimized away by compiler
    benchmark::DoNotOptimize(output);
  }
}
// Register the function as a benchmark
BENCHMARK(kernelInner);

static void kernelOuter(benchmark::State& state) {
  // Code before the loop is not measured
  const auto & input = randomNumbers(16384, -1.0F, 1.0F);
  const auto & kernel = randomNumbers(16, 0.0F, 1.0F);

  const auto convolutionLength = input.size() + kernel.size() - 1;
  auto output = std::vector<float>(convolutionLength, 0.0F);
    
  // Code inside this loop is measured repeatedly
  for (auto _ : state) {
    convolutionKernelOuter(std::span(input), std::span(kernel), std::span(output));
    // Make sure the variable is not optimized away by compiler
    benchmark::DoNotOptimize(output);
  }
}
// Register the function as a benchmark
BENCHMARK(kernelOuter);

๐Ÿ“œ History

This repository was initially intended to explore Single Instruction Multiple Data (SIMD) in C++. Since convolution is such an essential part of filtering in digital signal processing and a central part of convolutional neuronal networks (CNN), it seemed obvious to try that first.

A complex topic like this benefits from a "experiment - evaluate" approach to get started. Catch2 is used to assure, that the implementations are correct. It is also used to benchmark their performance. Compiler Options for vectorization reports get insights into what the compiler does. Online-Tools are used to exploratory experiment with implementations. Finally, to get a visual response, vega-lite is used to create bar charts of the benchmark results.

๐Ÿ”Ž References

About

Benchmark convolution implementations in C++ with Catch2 visualized with Vega-Lite

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Languages