Skip to content

Latest commit

 

History

History
1553 lines (1245 loc) · 48.2 KB

README.adoc

File metadata and controls

1553 lines (1245 loc) · 48.2 KB
Kokkidio Logo

Overview

Kokkidio is a header-only template library designed to provide interoperability between the linear algebra library Eigen and the performance portability framework Kokkos. Its aim is to allow the easy-to-read, succinct source code of Eigen to be compiled to fast machine code on all the platforms supported by Kokkos.

It consists of three interlocking elements:

  • An iteration range class (ParallelRange), which allows threads to efficiently access their assigned elements,

  • parallel dispatch functions (parallel_for and parallel_reduce) compatible with functors taking ParallelRange, and

  • the data structures ViewMap and DualViewMap, which combine an Eigen::Map and a Kokkos::View.

It allows you to write code like this:

ViewMap construction, copying data, and example computation (SAXPY) with Kokkidio

(taken from example/axpy.cpp)

using namespace Kokkidio;
float a {0.5};
int size {10};

using FloatArray = DualViewMap<Eigen::ArrayXf>;
/* You may have an existing Eigen object */
Eigen::ArrayXf x_existing {size};
/* No need to replace it. To make it accessible inside a Kokkos functor
 * (and thus also on GPUs), you can simply wrap it in a (Dual)ViewMap: */
FloatArray x {x_existing};

/* Of course, you can also construct (Dual)ViewMaps from sizes */
FloatArray y {size}, z {size};

/* You can use Kokkos functions on (Dual)ViewMaps, because their members
 * "MapView::view", and
 * "DualViewMap::view_<target>()"
 * return a Kokkos::View */
Kokkos::deep_copy( y.view_host(), 123 );

/* Likewise, you can use Eigen functions on (Dual)ViewMaps, as their members
 * "MapView::map", and
 * "DualViewMap::map_<target>()"
 * return an Eigen::Map.
 * Outside of a parallel dispatch, only the host side is accessible. */
x.map_host().setRandom();
y.map_host().setRandom();

/* Copying data between host and compute target is simple: */
x.copyToTarget(); // if the compute target is the host, this does nothing
y.copyToTarget();

/* This is how a parallel computation on the target is performed: */
parallel_for( size, KOKKOS_LAMBDA(ParallelRange<> rng){
	rng(z) = a * rng(x) + rng(y);
});
/* After the computation, you may copy the results back to the host */
z.copyToHost();

Motivation

While Eigen offers an expressive syntax and extensive functionality for linear algebra operations, its abstractions complicate the process of adapting it to parallel execution on GPUs. Kokkidio was conceived as a way to streamline this process by providing a lightweight, header-only library that bridges the gap between Eigen and the portability framework Kokkos.

Why Eigen?

Eigen is a linear algebra library, which allows writing any such operations in a readable, succinct manner, while compiling to highly optimised machine code. Instead of implementing matrix, vector, or coefficient-wise operations using loops, the 'Eigen way' is to operate on whole data structures using their overloaded operators or member functions. Here’s an example:

SAXPY in pure C++
std::size_t size {10};
double a {0.5};
std::vector<double> x {size}, y {size}, z {size};
/* fill vectors in some way ... */
/* then loop over them and perform the computation element-wise */
for (std::size_t i=0; i<size; ++i){
	z[i] = a * x[i] + y[i];
}
SAXPY in Eigen
Eigen::Index size {10};
float a {0.5};
Eigen::VectorXf x {size}, y {size}, z {size};
/* fill vectors in some way, e.g. using member func setRandom() ... */
/* The computation is expressed with the whole object: */
z = a * x + y;

In addition to the benefits of letting a well-tested library handle common operations, Eigen also performs extensive optimisations on the underlying loops. This notably includes explicit vectorisation for all common CPU vector extensions, such as the SSE and AVX families on x86, ARM NEON, and others.

Why Kokkos?

Kokkos is a portability and parallelism framework, which allows users to write code that compiles and runs on a wide range of hardware, especially both CPUs and GPUs. It provides data structures (Kokkos::View), as well as functions for parallel dispatch, and may be viewed as a common abstraction for various programming models and backends, e.g. OpenMP, including target offloading, OpenACC, CUDA, HIP, and SYCL. Here’s an example:

SAXPY with Kokkos
float a {0.5};
std::size_t dim1 {10};
/* for more details, see
 * https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/View.html#constructing-a-view
 */
using View = Kokkos::View<float*, Kokkos::DefaultExecutionSpace>;
View x {dim1}, y {dim1}, z {dim1};
/* fill arrays in some way,
 * e.g. using deep_copy or within a parallel dispatch ... */

/* and now do the computation in parallel */
Kokkos::parallel_for( dim1, KOKKOS_LAMBDA(std::size_t i){
	z(i) = a * x(i) + y(i);
});

While modern compilers already allow targeting multiple CPU architectures, due to ISAs functioning as a reliable standard (x86(-64), ARM, …​), GPU architectures and programming models are much less intercompatible. If you wrote your code using CUDA, but want to run it on an Intel PonteVecchio GPU, you will have to rewrite large parts of it.

Using Kokkos for new projects allows you to avoid that type of effort. For existing projects, it allows a step-by-step migration, because backend-specific code will still compile when using Kokkos on top of it.

The problems

There exist a number of problems when trying to use Eigen’s functionality on GPUs (e.g. with Kokkos), which Kokkidio aims to address.

  1. Data structures
    Since version 3.4, Eigen functions can be called from CUDA/HIP kernels, and its compatibility with SYCL is in a usable state since at least early 2024 and under active development. However, only fixed-size objects can be created on GPUs — dynamically-sized data structures are only available on and from the host. There is the Eigen::Map class, however, through which other data structures can be treated like Eigen objects, if they expose a data pointer and their memory layout and stride(s) are known.

  2. Loop abstraction and parallelism 1
    In the Kokkos example above, you can see the functor passed to Kokkos' parallel_for (the KOKKOS_LAMBDA in that case) containing the instructions for an individual loop iteration. By contrast, the loop is fully abstracted in the Eigen example. There is no user-facing loop iteration variable in an Eigen operation that could be used to parallelise operations.

    Of course, that loop still exists, deep within the Eigen library. While it could potentially be parallelised there, the interface of Eigen is not set up to facilitate anything like this. Therefore, sweeping changes to the whole library would be necessary, with all the compatibility and testing issues that brings.

  3. Parallelism 2: Chunk sizes, vectorisation, and readability
    When parallelising an Eigen operation as a library user, the ideal strategy for assigning work to threads is rather different between CPUs and GPUs. On a GPU, the large number of relatively weak cores leads to many small units of work performing better. By contrast, on a CPU, fewer, larger, and contiguous chunks per thread are preferable, to allow for vectorised operations.

    Here’s an example to illustrate this difference. We’re not using SAXPY, but instead multiply matrix columns, so that individual threads still use Eigen functionality.

    Multiply matrix columns

    (taken from colmult_eigen.cpp)

    int nRows {4}, nCols {1000};
    Eigen::MatrixXd a {nRows, nCols}, b;
    b.resizeLike(a);
    /* fill matrices in some way ... */
    double result; // let's sum up the results to not need another array
    /* One could do a nested loop and manually implement the dot product.
     * We skip that here, because for that you wouldn't use Eigen */
    
    /******************************************************************************/
    /* OPTION 1, better on GPUs */
    /******************************************************************************/
    /* Distribute individual column-multiplications,
     * as one might do on a GPU, if nCols >> nRows */
    result = 0;
    for (int i=0; i<nCols; ++i){
    	result += a.col(i).transpose() * b.col(i);
    	/* this is equivalent: */
    	// result += a.col(i).dot( b.col(i) );
    }
    
    /******************************************************************************/
    /* OPTION 2, better on CPUs */
    /******************************************************************************/
    /* Distribute blocks of the matrices to threads and let Eigen
     * handle the loop over columns, as may be preferable on a CPU.
     * This can be a lot faster, as it allows Eigen to vectorise the operation. */
    result = 0;
    int nCores {4}; // just for illustration
    int nColsPerCore {nCols / nCores}; // not handling remainders
    
    for (int i=0; i<nCores; ++i){
    	int firstCol {i * nColsPerCore};
    	result += (
    		a.middleCols(firstCol, nColsPerCore).transpose() *
    		b.middleCols(firstCol, nColsPerCore)
    	).trace(); // trace = sum of the diagonal
    }

    Not only do the loop bodies differ quite significantly, but Eigen’s advantage of short and readable source code is also somewhat diminished, when parallelising code in this fashion.

Approaching solutions

Kokkidio solves the first issue of Eigen’s data structures not being compatible with GPU programming models by wrapping a Kokkos::View in a class called ViewMap. It provides access to the contained Kokkos::View, and additionally uses Eigen’s Map feature to allow the contained data to be treated like an Eigen object — even on GPUs. More detail is provided in the section Data structures.

To address the issues regarding parallelism, vectorisation, and readability, Kokkidio provides an iteration range class called ParallelRange and parallel dispatch functions compatible with it. These are detailed in the next section.

Parallel dispatch

ParallelRange

The class template ParallelRange fulfils two functions: Firstly, given some total number or range of items to be processed, it contains the index or index range associated with the executing thread, and secondly, when applied to an Eigen or (Dual)ViewMap object, returns the data at that index or index range as an Eigen::Block expression.

Its template parameter target can take either of the two values of the Target enumeration, host (CPU) or device (e.g., GPU):

  • When target==device, then ParallelRange stores a single index. Applying it to an Eigen or (Dual)ViewMap object returns either a single element, if the object is one-dimensional, or a column expression, if the object is two-dimensional. By default, Eigen objects are column-major, which is the reason behind this choice.

  • When target==host, then ParallelRange stores a starting index and number of elements. Applying it to an Eigen object or (Dual)ViewMap then returns a contiguous block of elements, using Eigen::segment on 1D objects, and Eigen::middleCols on 2D objects.

(Ranges of) rows instead of columns are also available, but require the explicit use of a member function (ParallelRange::rowRange), rather than ParallelRange::operator().

Synopsis

Expand synopsis of ParallelRange
template<Target _target = DefaultTarget>
class ParallelRange : public EigenRange<_target> {
public:
	static constexpr Target target {_target};
	using Base = EigenRange<target>;
	static constexpr bool
		isDevice {target == Target::device},
		isHost   {target == Target::host};
	using MemberType = std::conditional_t<isHost, IndexRange<Index>, int>;
	using ChunkType = EigenRange<target>;
	using ChunkInfoType = ChunkInfo<target>;

private:
	MemberType m_rng;
	ChunkInfoType m_chunks;
public:
	KOKKOS_FUNCTION ParallelRange() = default;

	/* ParallelRange can be instantiated with:
	 * - an integer,
	 * - a Kokkidio::IndexRange, or
	 * - a Kokkos::RangePolicy.
	 */
	template<typename Policy>
	KOKKOS_FUNCTION ParallelRange( const Policy& );

/* inherited from EigenRange: */
	KOKKOS_FUNCTION const MemberType& get() const;
	KOKKOS_FUNCTION       MemberType& get();

	KOKKOS_FUNCTION IndexRange<Index> asIndexRange() const;

	template<typename EigenObj>
	KOKKOS_FUNCTION Eigen::Block<...> colRange( EigenObj&& obj ) const;

	template<typename EigenObj>
	KOKKOS_FUNCTION Eigen::Block<...> rowRange( EigenObj&& obj ) const;

	template<typename EigenObj>
	KOKKOS_FUNCTION Eigen::Block<...> range( EigenObj&& obj ) const;

	/* effectively the same as range(...) */
	template<typename EigenObj>
	KOKKOS_FUNCTION Eigen::Block<...> operator() ( EigenObj&& obj ) const;

/* specific to ParallelRange */
	template<typename Func>
	KOKKOS_FUNCTION void for_each( Func&& func ) const;

	template<typename Func>
	KOKKOS_FUNCTION void for_each_chunk(Func&& func) const;

	KOKKOS_FUNCTION ChunkType make_chunk(Index i) const;
	KOKKOS_FUNCTION const ChunkInfo<target>& chunkInfo() const;
	KOKKOS_FUNCTION inline constexpr Index   chunkSize() const;
	KOKKOS_FUNCTION inline constexpr Index   nChunks  () const;
	KOKKOS_FUNCTION void setChunks(Index chunkSizeMax = chunk::defaultSize);
};


/* if you wish to call another function taking Eigen objects,
 * and wish to apply a range to each of the arguments, you can write
 * apply_range(someFunc, range, someFunc_arg1, someFunc_arg2, ...); */
template<Func, Target t, typename ... Ts>
void apply_range(Func&&, const ParallelRange<t>&, Ts&& ... args);

Parallel dispatch functions

Kokkidio provides drop-in replacements for Kokkos' parallel dispatch functions:

  • parallel_for, for general tasks, and

  • parallel_reduce, for reductions.

The main difference to their Kokkos equivalents is, that they allow passing a functor which takes a ParallelRange as its (first) argument, e.g.:

parallel_for(someSizeOrPolicy, KOKKOS_LAMBDA(ParallelRange<target> rng){
	/* do something with rng ... */
});

On device (e.g., GPU), this chains to Kokkos::parallel_[for|reduce], and constructs a ParallelRange<device> from a single element index. On host (CPU), this calls a Kokkidio-specific function emulating OpenMP-logic for distributing work items evenly to threads. The index range of work items consists of a start index and a number of items, from which a ParallelRange<host> is created. This, when applied to an Eigen or (Dual)ViewMap object, returns a contiguous Eigen::Block of data, corresponding to the index range.

If a functor is provided that does not take a ParallelRange as its first parameter, Kokkidio's parallel dispatch functions simply forward to their Kokkos equivalent (except for CPU execution with SYCL).

The first parameter (someSizeOrPolicy in the example above) can be an integer number, a Kokkos::RangePolicy, or a Kokkidio::IndexRange

A variant of these is parallel_[for|reduce]_chunks. See section (Chunk) buffers for details.

Examples

parallel_for

(shortened from SAXPY example in in overview)

using namespace Kokkidio;
float a {0.5};
int size {1000};

/* create and set the input (x, y) and output (z) arrays */
using FloatArray = DualViewMap<Eigen::ArrayXf>;
FloatArray x {size}, y {size}, z {size};

x.map_host().setRandom();
y.map_host().setRandom();

x.copyToTarget();
y.copyToTarget();

/* perform parallel computation (1D) */
parallel_for( size, KOKKOS_LAMBDA(ParallelRange<> rng){
	rng(z) = a * rng(x) + rng(y);
});

/* Copy the results back to host */
z.copyToHost();
parallel_reduce

(taken from colmult.cpp, see also Multiply matrix columns)

using namespace Kokkidio;
int nRows {4}, nCols {1000};

/* create and set the input matrices */
using MatrixView = DualViewMap<Eigen::MatrixXd>;
MatrixView a {nRows, nCols}, b;
b.resizeLike(a);

a.map_host().setRandom();
b.map_host().setRandom();

a.copyToTarget();
b.copyToTarget();

double result = 0;
/* perform parallel computation and reduction (2D -> column range) */
parallel_reduce(
	a.cols(),
	KOKKOS_LAMBDA(ParallelRange<> rng, double& sum){
		sum += ( rng(a).transpose() * rng(b) ).trace(); // trace = sum of the diagonal
		/* equivalent: sum of coefficient-wise products */
		sum += ( rng(a).array() * rng(b).array() ).sum();
	},
	redux::sum(result)
);

(Chunk) buffers

When intermediate results are needed in a computational loop, their implementation in (scalar) C++ is straightforward, as one would simply define a stack variable:

std::vector<double> arr {size};
for (double& a : arr){
	/* let's say we need the result of some expensive function multiple times */
	double helperVar = std::pow(a, 0.3);
	a = helperVar + 1 / helperVar;
}

However, Eigen offers no obvious way to define intermediate results as stack variables, as the loop is abstracted. Instead, we could use an array buffer:

Eigen::ArrayXd arr {size}, buf {size};
buf = arr.pow(0.3);
arr = buf + 1 / buf;

This may not be ideal, as a whole array is allocated, even though only a couple of values are needed, and additional memory accesses are introduced.

Kokkidio offers a ChunkBuffer structure, which alleviates this issue within a parallel dispatch. The ChunkBuffer requires an Eigen type (ColType) as its first template parameter. Here, users provide the fixed-size type closest to what they’d need in a single loop iteration, e.g., for the example above, an Eigen::Array<double, 1, 1> (a double array with a single element). This allows users to define whether they require Eigen::Array or Eigen::Matrix behaviour.

The second template parameter is the Target enumeration. On host, a ChunkBuffer contains one array per thread (using omp_get_max_threads), with the same number of rows as ColType, and the chunk size of the parallel loop as the number of columns. A reasonable default is chosen for the chunk size, but it can also optionally be specified via a Kokkos::RangePolicy. This improves the likelihood of the buffer data remaining in cache.

On device, a ChunkBuffer contains no data. Instead, it only wraps ColType. When accessing this buffer inside a parallel dispatch, the type gets instantiated — and because fixed-size types are allocated on the stack, this is equivalent to defining stack variables in a C-style loop.

Usage:

ViewMap<Eigen::ArrayXd> arr {size};

/* number of intermediate values = number of rows */
constexpr int numVals {1};
using Array1d = Eigen::Array<double, numVals, 1>;

/* create buffer */
auto chunkBuf = makeBuffer<Array1d, target>(size);

parallel_for_chunks<target>(size, KOKKOS_LAMBDA(EigenRange<target> chunk){
	/* access buffer inside parallel dispatch */
	auto buf = getBuffer(chunkBuf, chunk);

	/* use the EigenRange "chunk" like a ParallelRange */
	buf = chunk(arr).pow(0.3);
	chunk(arr) = buf + 1 / buf;
});

The dispatch function parallel_for_chunks is a shortcut for the following:

parallel_for<target>(size, KOKKOS_LAMBDA(ParallelRange<target> rng){
	rng.for_each_chunk( [&](auto chunk){
		/* same body */
	});
});

See EigenRange for the data type of the chunk parameter.

Data structures

ViewMap

The core of the ViewMap class (see file) are the two member functions map() and view(), which return an Eigen::Map, and a Kokkos::View respectively, and thus allow it to be used in either library’s functions.

ViewMap takes two template parameters:

  1. EigenType: The Eigen class to be used as the map type, e.g. Eigen::MatrixXd or Eigen::Array3i. The return type of map() behaves the same way as this type. Only dense types are currently supported.

  2. A Target enumeration value, which can be either host or device. This parameter is optional. Its default value matches Kokkos::DefaultExecutionSpace.

ViewMap can be instantiated either using an existing Eigen object, or using the same size parameters as you would for the Eigen type. Here’s what happens when you create a ViewMap:

  1. With an existing Eigen object:

    1. Instantiation on Target::host: No allocation is performed. An unmanaged Kokkos::View is created, using the existing object’s data pointer and sizes.

    2. Instantiation on Target::device: the Eigen object’s sizes are used to create a matching managed Kokkos::View on the device.

  2. With size parameters: A managed Kokkos::View is created using these sizes on Target. The same size parameters are allowed as for the respective Eigen type. This means, creating vector types (1D) requires only a single size parameter, and fixed size types can be created without them.

In all of the above cases, the data pointers of view() and map() contain the same address. Furthermore, when instantiating a ViewMap with a non-const, owning Eigen object (i.e. not itself an Eigen::Map), a non-owning pointer to the object is stored to allow resizing both the Kokkos::View and the Eigen object via ViewMap::resize().

Examples

Expand ViewMap examples

(taken from examples/ViewMap.cpp)

using namespace Kokkidio;
int nRows {10}, nCols {20};

/* existing Eigen object */
Eigen::ArrayXXd eigenArray {nRows, nCols};

/* Create ViewMap using a constructor or factory function.
 * Deduces Eigen type, and uses default target */
ViewMap mv1 {eigenArray};
auto mv2 = viewMap(eigenArray);

/* Create ViewMap using factory function for specific target,
 * while deducing Eigen type */
auto mv3 = viewMap<Target::host>(eigenArray);

/* Create ViewMap using size parameters.
 * ArrayXXd is dynamically sized in both dimensions,
 * so two parameters are required */
ViewMap<Eigen::ArrayXXd> mv4 {nRows, nCols};

/* ArrayXd is a column vector, so only rows are required */
ViewMap<Eigen::ArrayXd> mv5 {nRows};

/* Array3d is a fixed size type, so no parameters are required */
ViewMap<Eigen::Array3d> mv6;

/* set values on host, using Eigen's assignment operator on ViewMap::map() */
mv1.map() = 1;

/* set values on target, using Kokkos::deep_copy with ViewMap::view() */
Kokkos::deep_copy(mv2.view(), 2);

/* set values on target with parallel dispatch: */
/* with Kokkidio::ParallelRange */
parallel_for( mv3.cols(), KOKKOS_LAMBDA(ParallelRange<> rng){
	rng(mv3) = 3;
});

/* or just an integer, using the standard Kokkos-style */
parallel_for( mv4.size(), KOKKOS_LAMBDA(int i){
	mv4.data()[i] = 4;
});

Synopsis

Expand synopsis of ViewMap
template<typename _EigenType, Target targetArg = DefaultTarget>
class ViewMap {
public:
	static constexpr Target target { ExecutionTarget<targetArg> };
	using EigenType_host = _EigenType;
	/* EigenType_host and EigenType_target may differ in const-ness */
	using EigenType_target = std::conditional_t<target == Target::host,
		EigenType_host,
		std::remove_const_t<EigenType_host>
	>;

	using ThisType = ViewMap<EigenType_target, target>;

	using Scalar     = typename EigenType_target::Scalar;
	using MapType    = Eigen::Map<EigenType_host>;
	/* only types with a continuous memory layout are currently supported */
	static_assert( is_contiguous<EigenType_target>() );

	/* Translations of "target" into Kokkos spaces */
	using MemorySpace    = Kokkidio::MemorySpace   <target>;
	using ExecutionSpace = Kokkidio::ExecutionSpace<target>;
	/* The Kokkos::View data type is either fully dynamic or fully fixed-size,
	 * i.e. Scalar** or Scalar[nRows][nCols],
	 * and always uses LayoutLeft */
	using ViewType   = Kokkos::View<..., Kokkos::LayoutLeft, MemorySpace>;
	using HostMirror = typename ViewType::HostMirror;

public:

	/* constructors */
	ViewMap(); // default, allocation only for fixed size types
	ViewMap(Index size); // 1D types
	ViewMap(Index rows, Index cols); // 2D types
	ViewMap( _EigenType& hostObj ); // existing Eigen objects

	/* "resize" and constructors can only be called from host */
	void resize(Index rows, Index cols);

	/* get some info about type and status */
	KOKKOS_FUNCTION constexpr bool isManaged() const;
	KOKKOS_FUNCTION bool isAlloc() const;

	/* data pointer */
	KOKKOS_FUNCTION Scalar* data();
	KOKKOS_FUNCTION const Scalar* data() const;

	/* get Eigen::Map */
	KOKKOS_FUNCTION MapType map() const;

	/* and Kokkos::View */
	KOKKOS_FUNCTION ViewType view() const;

	/* sizes */
	KOKKOS_FUNCTION Index rows() const;
	KOKKOS_FUNCTION Index cols() const;
	KOKKOS_FUNCTION Index size() const;
};

/* detection */
template<typename T>
inline constexpr bool is_ViewMap_v = ...;


/* factory functions */

/* specify target, deduce EigenType */
template<Target target = DefaultTarget, typename EigenType>
ViewMap<EigenType, target> viewMap( EigenType& eigenObj );

/* specify EigenType, optionally specify target, fixed size */
template<typename EigenType, Target target = DefaultTarget>
ViewMap<EigenType, target> viewMap();

/* specify EigenType, optionally specify target, 1D */
template<typename EigenType, Target target = DefaultTarget>
ViewMap<EigenType, target> viewMap(Index vectorSize);

/* specify EigenType, optionally specify target, 2D */
template<typename EigenType, Target target = DefaultTarget>
ViewMap<EigenType, target> viewMap(Index rows, Index cols);

DualViewMap

DualViewMap (see file) is designed to facilitate easy data exchange between host and the compute Target. To this end, it provides the member functions copyToTarget() and copyToHost().

It takes the same template parameters as ViewMap, i.e. an Eigen type, and a Target value. While a ViewMap only exists on either host or device, DualViewMap always consists of two ViewMaps, of which one is located on host, and the other on the specified Target. If the Target is also host, then the two views are identical, and copyTo…​() operations are correspondingly skipped.

To access the ViewMaps, it provides the member functions get_host() and get_target(), as well as shortcuts to their map()/view() member functions in the form of map_host()/map_target() and view_host()/view_target().

Similar to ViewMap, it also allows to resize() its data, and does so on both host and the specified Target.

Examples

Expand DualViewMap examples
using namespace Kokkidio;
int nRows {10}, nCols {20};

/* existing Eigen object */
Eigen::ArrayXXd eigenArray {nRows, nCols};
/* By default, when initialising with an Eigen object,
 * the object's data is copied to the target.
 * This behaviour be changed with an optional parameter: DontCopyToTarget */
DualViewMap d1 {eigenArray};
auto d2 = dualViewMap(eigenArray, DontCopyToTarget);
/* Otherwise, a DualViewMap can be created in exactly the same ways as a
 * ViewMap, so please refer to ViewMap.cpp for more examples. */

/* with DualViewMap, you can set your values on host,
 * then copy them to the target: */
d2.map_host() = 123;
d2.copyToTarget();

auto print = [&](std::string_view descriptor){
	std::cout
		<< "d2, values on host, " << descriptor << ":\n"
		<< d2.map_host() << '\n';
};
print("before");

/* Now you can do some computations on the target,
 * then copy the values back */
parallel_for(d2.cols(), KOKKOS_LAMBDA(ParallelRange<> rng){
	rng(d2) += 1;
});
d2.copyToHost();

print("after");

Synopsis

Expand synopsis of DualViewMap
template<typename _EigenType, Target targetArg = DefaultTarget>
class DualViewMap {
public:
	static constexpr Target target { ExecutionTarget<targetArg> };
	using EigenType_host = _EigenType;

	using ThisType = DualViewMap<EigenType_host, target>;
	using ViewMap_host   = ViewMap<EigenType_host, Target::host>;
	using ViewMap_target = ViewMap<EigenType_host, target>;
	using EigenType_target = typename ViewMap_target::EigenType_target;
	using Scalar = typename ViewMap_target::Scalar;

	using ViewType_host   = typename ViewMap_host  ::ViewType;
	using ViewType_target = typename ViewMap_target::ViewType;
	using ExecutionSpace_target = typename ViewMap_target::ExecutionSpace;

	using MapType_host   = typename ViewMap_host  ::MapType;
	using MapType_target = typename ViewMap_target::MapType;

public:

	/* constructors */
	DualViewMap(); // default, allocation only for fixed size types
	DualViewMap(Index size); // 1D types
	DualViewMap(Index rows, Index cols); // 2D types
	DualViewMap(
		EigenType_host& hostObj,
		DualViewCopyOnInit copyToTarget = CopyToTarget
	);  // existing Eigen objects


	/* "assign", "resize" and constructors can only be called from host */
	void assign( EigenType_host& hostObj );
	void resize(Index rows, Index cols);

	/* get some info about type and status */
	KOKKOS_FUNCTION bool isAlloc_host() const;
	KOKKOS_FUNCTION bool isAlloc_target() const;

	/* get ViewMaps */
	KOKKOS_FUNCTION ViewMap_host   get_host  () const;
	KOKKOS_FUNCTION ViewMap_target get_target() const;

	template<Target _target>
	KOKKOS_FUNCTION auto get() const
		-> std::conditional<_target == target, ViewMap_target, ViewMap_host>;

	/* get Kokkos::Views */
	KOKKOS_FUNCTION ViewType_host   view_host  () const;
	KOKKOS_FUNCTION ViewType_target view_target() const;

	template<Target _target>
	KOKKOS_FUNCTION auto view() const
		-> std::conditional<_target == target, ViewType_target, ViewType_host>;

	/* shortcut to view_target */
	KOKKOS_FUNCTION ViewType_target view() const;

	/* get Eigen::Maps */
	KOKKOS_FUNCTION MapType_host   map_host  () const;
	KOKKOS_FUNCTION MapType_target map_target() const;

	template<Target _target>
	KOKKOS_FUNCTION auto map() const
		-> std::conditional<_target == target, MapType_target, MapType_host>;

	/* shortcut to map_target */
	KOKKOS_FUNCTION MapType_target map() const;

	/* sizes */
	KOKKOS_FUNCTION Index rows() const;
	KOKKOS_FUNCTION Index cols() const;
	KOKKOS_FUNCTION Index size() const;

	/* copy */
	void copyToTarget(bool async = false);
	void copyToHost(bool async = false);
};

/* detection */
template<typename T>
inline constexpr bool is_DualViewMap_v = ...;


/* factory functions */

/* specify target, deduce EigenType */
template<Target target = DefaultTarget, typename EigenType>
DualViewMap<EigenType, target> dualViewMap(
	EigenType& eigenObj,
	DualViewCopyOnInit copyToTarget = CopyToTarget
);

/* specify EigenType, optionally specify target, fixed size */
template<typename EigenType, Target target = DefaultTarget>
DualViewMap<EigenType, target> dualViewMap();

/* specify EigenType, optionally specify target, 1D */
template<typename EigenType, Target target = DefaultTarget>
DualViewMap<EigenType, target> dualViewMap(Index vectorSize);

/* specify EigenType, optionally specify target, 2D */
template<typename EigenType, Target target = DefaultTarget>
DualViewMap<EigenType, target> dualViewMap(Index rows, Index cols);

Other

IndexRange

IndexRange is a small, but expressive helper class intended to unify the different ways in which ranges of discrete items are described in Eigen and Kokkos: Eigen’s block operations use the first index and the number of items, while Kokkos (e.g. RangePolicy) describe ranges using begin and end indices.

Use it to define the work range (first argument) in a parallel dispatch, if your work range does not start at zero.

Expand IndexRange synopsis
/* empty structs used in IndexRange constructors */
struct LimitIsEnd {};
struct LimitIsSize {};

template<typename _Integer>
class IndexRange {
public:
	using Integer = _Integer;
	/* synonyms to match Kokkos */
	using index_type = Integer;
	using value_type = Integer;

	/* the actual data member */
	struct Aggregate {
		Integer start, size;
	} values;

public:
	/* getter */
	KOKKOS_FUNCTION Integer start() const;
	KOKKOS_FUNCTION Integer size () const;
	KOKKOS_FUNCTION Integer begin() const; // synonym for "start"
	KOKKOS_FUNCTION Integer count() const; // synonym for "size"
	KOKKOS_FUNCTION Integer end  () const;

	/* setter */
	KOKKOS_FUNCTION void start(Integer);
	KOKKOS_FUNCTION void size (Integer);
	KOKKOS_FUNCTION void begin(Integer); // synonym for "start"
	KOKKOS_FUNCTION void count(Integer); // synonym for "size"
	KOKKOS_FUNCTION void end  (Integer);

	template<typename LimitType = LimitIsSize>
	KOKKOS_FUNCTION void set(Integer start, Integer limit);

	/* constructors */
	KOKKOS_FUNCTION IndexRange() = default;
	KOKKOS_FUNCTION IndexRange(Integer size); // sets "start" to zero

	/* can be used with
	 * 1) size as second arg -> third arg can be LimitIsSize{} or omitted
	 * 2) end  as second arg -> third arg must be LimitIsEnd{} */
	template<typename LimitType = LimitIsSize>
	KOKKOS_FUNCTION IndexRange(Integer start, Integer limit, LimitType = {});

	/* conversion from other integral types */
	template<typename OtherInt, SFINAE...>
	KOKKOS_FUNCTION IndexRange( const IndexRange<OtherInt>& other );
};

EigenRange

EigenRange contains a data member, whose type is IndexRange on host, and a single Index on device. It provides functions for getting an Eigen::Block expression from an Eigen or (Dual)ViewMap object, whose extents conform to the data member, e.g., a contiguous range of rows/columns on host, or a single row/column on device. It is the base class of ParallelRange, as well as the functor parameter type in parallel_for_chunks and ParallelRange::for_each_chunk (see (Chunk) buffers).

Expand EigenRange synopsis
template<Target _target>
class EigenRange {
public:
	static constexpr Target target {_target};
	static constexpr bool isDevice {target == Target::device};
	static constexpr bool isHost   {target == Target::host};
	using MemberType = std::conditional_t<isHost, IndexRange<Index>, int>;

protected:
	MemberType m_rng;

public:
	KOKKOS_FUNCTION EigenRange() = default;

	KOKKOS_FUNCTION EigenRange( MemberType );

	KOKKOS_FUNCTION const MemberType& get() const;
	KOKKOS_FUNCTION       MemberType& get();

	KOKKOS_FUNCTION IndexRange<Index> asIndexRange() const;

	template<typename EigenObj>
	KOKKOS_FUNCTION Eigen::Block<...> colRange( EigenObj&& obj ) const;

	template<typename EigenObj>
	KOKKOS_FUNCTION Eigen::Block<...> rowRange( EigenObj&& obj ) const;

	template<typename EigenObj>
	KOKKOS_FUNCTION Eigen::Block<...> range( EigenObj&& obj ) const;

	/* effectively the same as range(...) */
	template<typename EigenObj>
	KOKKOS_FUNCTION Eigen::Block<...> operator() ( EigenObj&& obj ) const;
};

Installation and Usage

Build script

The multi-backend nature of Kokkos and, by extension, Kokkidio, makes the process of configuring and building rather involved. To help with this, the build script build.sh is included. On first run, it will create a file in env/nodes, whose name is your machine’s node name, i.e., the output of uname -n, plus the ending .sh. On subsequent runs, this node file is read by build.sh. In the node file, the device architecture and programming model ("backend") must be specified by the user:

Expand node file example

See https://kokkos.org/kokkos-core-wiki/keywords.html#architectures for possible values of Kokkos_ARCH.

###########
# Required:
###########

# uncomment one of these
backend_default=cuda
# backend_default=hip
# backend_default=sycl
# backend_default=ompt # for using OpenMP target offloading

Kokkos_ARCH=Kokkos_ARCH_...


###########
# Optional:
###########

# You can set paths to Kokkos and Eigen. The script can download them for you,
# and by setting [Eigen|Kokkos]_SRC, you can control the destination.
Eigen_SRC=./lib/eigen
Kokkos_SRC=./lib/kokkos

# Similarly, [Eigen|Kokkos]_BUILD controls where they will be built
Eigen_BUILD=$Eigen_SRC/build
Kokkos_BUILD=$Kokkos_SRC/build

# The install path is set with [Eigen|Kokkos]_INST. If an existing installation
# is found there, then the download, build, and install steps are skipped.
# This is tested by the script by looking for the respective CMake package file.
Eigen_INST="$Eigen_SRC/install"
Kokkos_INST="$Kokkos_SRC/install"

# any other code you want to run before building comes here ...

You need a compiler that matches the chosen backend. See Kokkos: compiler versions for details.

The node file approach was chosen to accomodate HPC systems, where code is often run on a different machine to where it was configured/compiled, thus making autodetection unreliable. To allow for the same node file to be used across multiple machines, e.g., login/compute nodes, you can define conditions/patterns in env/node_patterns.sh, e.g.

if [[ $node_name == *"loginNode"* ]]; then
	node_name="myNodeFile" # will make build.sh read env/nodes/myNodeFile.sh
fi

If any additional code needs to be run before configuring building, e.g. loading modules or setting a specific compiler, that can be done in the node file as well.

After setting up the node file, run ./build.sh -h to see the available options. To download, compile, and install all components, you may use ./build.sh -cdi all.

Manual installation

This section assumes that you already have installed Kokkos and Eigen. Kokkidio needs to be able to find their respective CMake packages. If they cannot be found automatically, you can provide the environment or CMake variables Kokkos_ROOT, and Eigen_ROOT or Eigen3_ROOT. See CMake’s find_package doc for details on these paths.

Kokkidio follows the general way of configuring and installing a CMake project, so you could use build commands such as this:

cmake \
	-S . -B ./build --prefix ./install \
	-DCMAKE_CXX_EXTENSIONS=Off \
	-D Eigen_ROOT=/path/to/eigen \
	-D Kokkos_ROOT=/path/to/kokkos

cmake --install ./build

To build the examples (for tests, change examples to tests), you can then use

cmake \
	-S ./src/examples -B /build/examples \
	-DCMAKE_BUILD_TYPE=Release \
	-D Kokkidio_ROOT=./install

CMake integration

Kokkidio creates a CMake configuration file, so that it can be found with

find_package(Kokkidio)

If you didn’t install Kokkidio in a standard directory, then you need to provide the CMake or environment variable Kokkidio_ROOT. It should point to the installation directory, i.e. the one containing the directories include and lib.

Due to some trickery that Kokkidio has to apply to source files, you need to use the function kokkidio_configure_target on your CMake targets, e.g.

cmake_minimum_required(VERSION 3.21 FATAL_ERROR)

project(MyProject)

find_package(Kokkidio REQUIRED)

add_executable(myTarget "")

target_sources(myTarget PRIVATE
	myFile1.cpp
	myFile2.cpp
)

# *after* adding source files to your target, configure it with Kokkidio
kokkidio_configure_target(myTarget)

For reasons detailed in CPU-Vectorisation with CUDA backend, we recommend setting up separate CPU and device code files. These can #include the same unified code file, as detailed in that section. Then, you can declare CPU files in CMake using Kokkidio's function set_is_cpu, e.g.:

set_is_cpu(myFile1.cpp)

Notes

Building Kokkos for Kokkidio

To achieve CPU parallelism with Kokkidio, Kokkos should always be built with the OpenMP backend enabled, i.e. KOKKOS_ENABLE_OPENMP=ON.

CPU-Vectorisation with CUDA backend

By default, Eigen disables vectorisation, when __CUDACC__ is defined, i.e., when nvcc is used as the compiler for non-.cu-files. The single-source approach of Kokkos aims to eliminate the need for backend-specific files, thus combining Eigen and Kokkos lets this issue emerge.

The fix Kokkidio uses, is to separate CPU and non-CPU translation units, and then defining EIGEN_NO_CUDA for the CPU unit. For convenience, the CMake function set_is_cpu is provided for this purpose. Here’s how to do this in practice:

  • Put your source code into some file, and don’t specify it as a source file in CMake. We recommend the file ending .in for this. Let’s call it source.in

  • Create two additional files, one each for host and device compilation, respectively. Let’s call these source_host.cpp and source_device.cpp. Add these files to your CMake target, and add the line

    set_is_cpu(source_host.cpp)
  • #include the source file (source.in) in both of these files.

  • optionally, you can write your functions as templates of the Target parameter of all Kokkidio classes, and explicitly instantiate them in those files.

Here’s a full example:

source.in
#include <Kokkidio.hpp>

template<Target target>
float getSum( const ViewMap<ArrayXf, target>& vm ){
	float sum_global;
	auto func = KOKKOS_LAMBDA(ParallelRange<target> rng, float& sum_thread){
		sum_thread += rng(vm).sum();
	};
	parallel_reduce( vm.size(), func, redux::sum(sum_global) );
	return sum_global;
}
source_host.cpp
#include "source.in"
/* explicit instantiation for host */
template float getSum<Target::host>( const ViewMap<ArrayXf, Target::host>& );
source_device.cpp
#include "source.in"
/* explicit instantiation for device */
template float getSum<Target::device>( const ViewMap<ArrayXf, Target::device>& );
CMakeLists.txt
add_library(myTarget
	source_host.cpp
	source_device.cpp
)

set_is_cpu(source_host.cpp)

kokkidio_configure_target(myTarget)

CPU execution with SYCL

SYCL, at least in its oneAPI flavour, does not support parallel host execution on any non-Intel CPU. Therefore, Kokkidio defaults to redirecting parallel_for calls on Target::host to OpenMP. This behaviour is controlled through the preprocessor symbol KOKKIDIO_SYCL_DISABLE_ON_HOST.

The name Kokkidio is based on the assumptions that

  1. Kokkos refers to the Greek Κόκκος (engl.: grain, though possibly a play on kernel), and that

  2. Eigen refers to eigenvalues and eigenvectors.

The latter are ιδιοτιμή (idiotimí) and ιδιοδιάνυσμα (idiodiánysma) in Greek, from which the prefix ιδιο (idio) was taken (engl.: same, though it could also be from ίδιος = own, or self, which is the meaning of eigen in German). κοκκίδιο (kokkídio) could be seen as a portmanteau of Kokkos and idio, but is in fact the Greek word for granule, so not far off Kokkos itself.

The logo is a stretched/sheared map of a recolouration of the Kokkos logo, with the eigenvectors of that mapping drawn as arrows.

Licence

Kokkidio is maintained by the Chair of Water Resources Management and Modelling of Hydrosystems of the Technische Universität Berlin, or wahyd for short (Link). It is distributed under a GPLv3 (Licence text). Licence types for the libraries used in Kokkidio are listed in the LICENCE.README file.