Kokkidio is a header-only template library designed to provide interoperability between the linear algebra library Eigen and the performance portability framework Kokkos. Its aim is to allow the easy-to-read, succinct source code of Eigen to be compiled to fast machine code on all the platforms supported by Kokkos.
It consists of three interlocking elements:
-
An iteration range class (
ParallelRange
), which allows threads to efficiently access their assigned elements, -
parallel dispatch functions (
parallel_for
andparallel_reduce
) compatible with functors takingParallelRange
, and -
the data structures
ViewMap
andDualViewMap
, which combine anEigen::Map
and aKokkos::View
.
It allows you to write code like this:
ViewMap
construction, copying data, and example computation (SAXPY) with Kokkidio
(taken from example/axpy.cpp)
using namespace Kokkidio;
float a {0.5};
int size {10};
using FloatArray = DualViewMap<Eigen::ArrayXf>;
/* You may have an existing Eigen object */
Eigen::ArrayXf x_existing {size};
/* No need to replace it. To make it accessible inside a Kokkos functor
* (and thus also on GPUs), you can simply wrap it in a (Dual)ViewMap: */
FloatArray x {x_existing};
/* Of course, you can also construct (Dual)ViewMaps from sizes */
FloatArray y {size}, z {size};
/* You can use Kokkos functions on (Dual)ViewMaps, because their members
* "MapView::view", and
* "DualViewMap::view_<target>()"
* return a Kokkos::View */
Kokkos::deep_copy( y.view_host(), 123 );
/* Likewise, you can use Eigen functions on (Dual)ViewMaps, as their members
* "MapView::map", and
* "DualViewMap::map_<target>()"
* return an Eigen::Map.
* Outside of a parallel dispatch, only the host side is accessible. */
x.map_host().setRandom();
y.map_host().setRandom();
/* Copying data between host and compute target is simple: */
x.copyToTarget(); // if the compute target is the host, this does nothing
y.copyToTarget();
/* This is how a parallel computation on the target is performed: */
parallel_for( size, KOKKOS_LAMBDA(ParallelRange<> rng){
rng(z) = a * rng(x) + rng(y);
});
/* After the computation, you may copy the results back to the host */
z.copyToHost();
While Eigen offers an expressive syntax and extensive functionality for linear algebra operations, its abstractions complicate the process of adapting it to parallel execution on GPUs. Kokkidio was conceived as a way to streamline this process by providing a lightweight, header-only library that bridges the gap between Eigen and the portability framework Kokkos.
Eigen is a linear algebra library, which allows writing any such operations in a readable, succinct manner, while compiling to highly optimised machine code. Instead of implementing matrix, vector, or coefficient-wise operations using loops, the 'Eigen way' is to operate on whole data structures using their overloaded operators or member functions. Here’s an example:
SAXPY in pure C++
std::size_t size {10};
double a {0.5};
std::vector<double> x {size}, y {size}, z {size};
/* fill vectors in some way ... */
/* then loop over them and perform the computation element-wise */
for (std::size_t i=0; i<size; ++i){
z[i] = a * x[i] + y[i];
}
SAXPY in Eigen
Eigen::Index size {10};
float a {0.5};
Eigen::VectorXf x {size}, y {size}, z {size};
/* fill vectors in some way, e.g. using member func setRandom() ... */
/* The computation is expressed with the whole object: */
z = a * x + y;
In addition to the benefits of letting a well-tested library handle common operations, Eigen also performs extensive optimisations on the underlying loops. This notably includes explicit vectorisation for all common CPU vector extensions, such as the SSE and AVX families on x86, ARM NEON, and others.
Kokkos is a portability and parallelism framework,
which allows users to write code that compiles and runs
on a wide range of hardware, especially both CPUs and GPUs.
It provides data structures
(Kokkos::View
),
as well as functions for
parallel dispatch,
and
may be viewed as a common abstraction
for various programming models and backends, e.g.
OpenMP, including target offloading, OpenACC, CUDA, HIP, and SYCL.
Here’s an example:
SAXPY with Kokkos
float a {0.5};
std::size_t dim1 {10};
/* for more details, see
* https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/View.html#constructing-a-view
*/
using View = Kokkos::View<float*, Kokkos::DefaultExecutionSpace>;
View x {dim1}, y {dim1}, z {dim1};
/* fill arrays in some way,
* e.g. using deep_copy or within a parallel dispatch ... */
/* and now do the computation in parallel */
Kokkos::parallel_for( dim1, KOKKOS_LAMBDA(std::size_t i){
z(i) = a * x(i) + y(i);
});
While modern compilers already allow targeting multiple CPU architectures, due to ISAs functioning as a reliable standard (x86(-64), ARM, …), GPU architectures and programming models are much less intercompatible. If you wrote your code using CUDA, but want to run it on an Intel PonteVecchio GPU, you will have to rewrite large parts of it.
Using Kokkos for new projects allows you to avoid that type of effort. For existing projects, it allows a step-by-step migration, because backend-specific code will still compile when using Kokkos on top of it.
There exist a number of problems when trying to use Eigen’s functionality on GPUs (e.g. with Kokkos), which Kokkidio aims to address.
-
Data structures
Since version 3.4, Eigen functions can be called from CUDA/HIP kernels, and its compatibility with SYCL is in a usable state since at least early 2024 and under active development. However, only fixed-size objects can be created on GPUs — dynamically-sized data structures are only available on and from the host. There is theEigen::Map
class, however, through which other data structures can be treated like Eigen objects, if they expose a data pointer and their memory layout and stride(s) are known. -
Loop abstraction and parallelism 1
In the Kokkos example above, you can see the functor passed to Kokkos'parallel_for
(theKOKKOS_LAMBDA
in that case) containing the instructions for an individual loop iteration. By contrast, the loop is fully abstracted in the Eigen example. There is no user-facing loop iteration variable in an Eigen operation that could be used to parallelise operations.Of course, that loop still exists, deep within the Eigen library. While it could potentially be parallelised there, the interface of Eigen is not set up to facilitate anything like this. Therefore, sweeping changes to the whole library would be necessary, with all the compatibility and testing issues that brings.
-
Parallelism 2: Chunk sizes, vectorisation, and readability
When parallelising an Eigen operation as a library user, the ideal strategy for assigning work to threads is rather different between CPUs and GPUs. On a GPU, the large number of relatively weak cores leads to many small units of work performing better. By contrast, on a CPU, fewer, larger, and contiguous chunks per thread are preferable, to allow for vectorised operations.Here’s an example to illustrate this difference. We’re not using SAXPY, but instead multiply matrix columns, so that individual threads still use Eigen functionality.
Multiply matrix columns
(taken from colmult_eigen.cpp)
int nRows {4}, nCols {1000}; Eigen::MatrixXd a {nRows, nCols}, b; b.resizeLike(a); /* fill matrices in some way ... */ double result; // let's sum up the results to not need another array /* One could do a nested loop and manually implement the dot product. * We skip that here, because for that you wouldn't use Eigen */ /******************************************************************************/ /* OPTION 1, better on GPUs */ /******************************************************************************/ /* Distribute individual column-multiplications, * as one might do on a GPU, if nCols >> nRows */ result = 0; for (int i=0; i<nCols; ++i){ result += a.col(i).transpose() * b.col(i); /* this is equivalent: */ // result += a.col(i).dot( b.col(i) ); } /******************************************************************************/ /* OPTION 2, better on CPUs */ /******************************************************************************/ /* Distribute blocks of the matrices to threads and let Eigen * handle the loop over columns, as may be preferable on a CPU. * This can be a lot faster, as it allows Eigen to vectorise the operation. */ result = 0; int nCores {4}; // just for illustration int nColsPerCore {nCols / nCores}; // not handling remainders for (int i=0; i<nCores; ++i){ int firstCol {i * nColsPerCore}; result += ( a.middleCols(firstCol, nColsPerCore).transpose() * b.middleCols(firstCol, nColsPerCore) ).trace(); // trace = sum of the diagonal }
Not only do the loop bodies differ quite significantly, but Eigen’s advantage of short and readable source code is also somewhat diminished, when parallelising code in this fashion.
Kokkidio solves the first issue of Eigen’s data structures
not being compatible with GPU programming models by
wrapping a Kokkos::View
in a class called ViewMap
.
It provides access to the contained Kokkos::View
,
and additionally uses Eigen’s Map
feature
to allow the contained data to be treated like an Eigen
object — even on GPUs.
More detail is provided in the section
Data structures.
To address the issues regarding parallelism, vectorisation, and readability,
Kokkidio provides
an iteration range class
called ParallelRange
and parallel dispatch functions compatible with it.
These are detailed in the next section.
The class template ParallelRange
fulfils two functions:
Firstly,
given some total number or range of items to be processed,
it contains the index or index range associated with the executing thread,
and secondly, when applied to an Eigen or (Dual)ViewMap
object,
returns the data at that index or index range as an Eigen::Block
expression.
Its template parameter target
can take either of
the two values of the Target
enumeration,
host
(CPU) or device
(e.g., GPU):
-
When
target==device
, thenParallelRange
stores a single index. Applying it to an Eigen or(Dual)ViewMap
object returns either a single element, if the object is one-dimensional, or a column expression, if the object is two-dimensional. By default, Eigen objects are column-major, which is the reason behind this choice. -
When
target==host
, thenParallelRange
stores a starting index and number of elements. Applying it to an Eigen object or(Dual)ViewMap
then returns a contiguous block of elements, usingEigen::segment
on 1D objects, andEigen::middleCols
on 2D objects.
(Ranges of) rows instead of columns are also available,
but require the explicit use of a member function (ParallelRange::rowRange
),
rather than ParallelRange::operator()
.
Expand synopsis of ParallelRange
template<Target _target = DefaultTarget>
class ParallelRange : public EigenRange<_target> {
public:
static constexpr Target target {_target};
using Base = EigenRange<target>;
static constexpr bool
isDevice {target == Target::device},
isHost {target == Target::host};
using MemberType = std::conditional_t<isHost, IndexRange<Index>, int>;
using ChunkType = EigenRange<target>;
using ChunkInfoType = ChunkInfo<target>;
private:
MemberType m_rng;
ChunkInfoType m_chunks;
public:
KOKKOS_FUNCTION ParallelRange() = default;
/* ParallelRange can be instantiated with:
* - an integer,
* - a Kokkidio::IndexRange, or
* - a Kokkos::RangePolicy.
*/
template<typename Policy>
KOKKOS_FUNCTION ParallelRange( const Policy& );
/* inherited from EigenRange: */
KOKKOS_FUNCTION const MemberType& get() const;
KOKKOS_FUNCTION MemberType& get();
KOKKOS_FUNCTION IndexRange<Index> asIndexRange() const;
template<typename EigenObj>
KOKKOS_FUNCTION Eigen::Block<...> colRange( EigenObj&& obj ) const;
template<typename EigenObj>
KOKKOS_FUNCTION Eigen::Block<...> rowRange( EigenObj&& obj ) const;
template<typename EigenObj>
KOKKOS_FUNCTION Eigen::Block<...> range( EigenObj&& obj ) const;
/* effectively the same as range(...) */
template<typename EigenObj>
KOKKOS_FUNCTION Eigen::Block<...> operator() ( EigenObj&& obj ) const;
/* specific to ParallelRange */
template<typename Func>
KOKKOS_FUNCTION void for_each( Func&& func ) const;
template<typename Func>
KOKKOS_FUNCTION void for_each_chunk(Func&& func) const;
KOKKOS_FUNCTION ChunkType make_chunk(Index i) const;
KOKKOS_FUNCTION const ChunkInfo<target>& chunkInfo() const;
KOKKOS_FUNCTION inline constexpr Index chunkSize() const;
KOKKOS_FUNCTION inline constexpr Index nChunks () const;
KOKKOS_FUNCTION void setChunks(Index chunkSizeMax = chunk::defaultSize);
};
/* if you wish to call another function taking Eigen objects,
* and wish to apply a range to each of the arguments, you can write
* apply_range(someFunc, range, someFunc_arg1, someFunc_arg2, ...); */
template<Func, Target t, typename ... Ts>
void apply_range(Func&&, const ParallelRange<t>&, Ts&& ... args);
Kokkidio provides drop-in replacements for Kokkos' parallel dispatch functions:
-
parallel_for
, for general tasks, and -
parallel_reduce
, for reductions.
The main difference to their Kokkos equivalents is,
that they allow passing a functor which takes a ParallelRange
as its
(first) argument, e.g.:
parallel_for(someSizeOrPolicy, KOKKOS_LAMBDA(ParallelRange<target> rng){
/* do something with rng ... */
});
On device
(e.g., GPU), this chains to Kokkos::parallel_[for|reduce]
,
and constructs a ParallelRange<device>
from a single element index.
On host
(CPU), this calls a Kokkidio-specific function
emulating OpenMP-logic for distributing work items evenly to threads.
The index range of work items consists of a start index and a number of items,
from which a ParallelRange<host>
is created. This,
when applied to an Eigen or (Dual)ViewMap
object,
returns a contiguous Eigen::Block
of data, corresponding to the index range.
If a functor is provided
that does not take a ParallelRange
as its first parameter,
Kokkidio's parallel dispatch functions
simply forward to their Kokkos equivalent
(except for CPU execution with SYCL).
The first parameter (someSizeOrPolicy
in the example above)
can be an integer number,
a Kokkos::RangePolicy
,
or a Kokkidio::IndexRange
A variant of these is parallel_[for|reduce]_chunks
.
See section (Chunk) buffers for details.
parallel_for
(shortened from SAXPY example in in overview)
using namespace Kokkidio;
float a {0.5};
int size {1000};
/* create and set the input (x, y) and output (z) arrays */
using FloatArray = DualViewMap<Eigen::ArrayXf>;
FloatArray x {size}, y {size}, z {size};
x.map_host().setRandom();
y.map_host().setRandom();
x.copyToTarget();
y.copyToTarget();
/* perform parallel computation (1D) */
parallel_for( size, KOKKOS_LAMBDA(ParallelRange<> rng){
rng(z) = a * rng(x) + rng(y);
});
/* Copy the results back to host */
z.copyToHost();
parallel_reduce
(taken from colmult.cpp, see also Multiply matrix columns)
using namespace Kokkidio;
int nRows {4}, nCols {1000};
/* create and set the input matrices */
using MatrixView = DualViewMap<Eigen::MatrixXd>;
MatrixView a {nRows, nCols}, b;
b.resizeLike(a);
a.map_host().setRandom();
b.map_host().setRandom();
a.copyToTarget();
b.copyToTarget();
double result = 0;
/* perform parallel computation and reduction (2D -> column range) */
parallel_reduce(
a.cols(),
KOKKOS_LAMBDA(ParallelRange<> rng, double& sum){
sum += ( rng(a).transpose() * rng(b) ).trace(); // trace = sum of the diagonal
/* equivalent: sum of coefficient-wise products */
sum += ( rng(a).array() * rng(b).array() ).sum();
},
redux::sum(result)
);
When intermediate results are needed in a computational loop, their implementation in (scalar) C++ is straightforward, as one would simply define a stack variable:
std::vector<double> arr {size};
for (double& a : arr){
/* let's say we need the result of some expensive function multiple times */
double helperVar = std::pow(a, 0.3);
a = helperVar + 1 / helperVar;
}
However, Eigen offers no obvious way to define intermediate results as stack variables, as the loop is abstracted. Instead, we could use an array buffer:
Eigen::ArrayXd arr {size}, buf {size};
buf = arr.pow(0.3);
arr = buf + 1 / buf;
This may not be ideal, as a whole array is allocated, even though only a couple of values are needed, and additional memory accesses are introduced.
Kokkidio offers a ChunkBuffer
structure,
which alleviates this issue within a parallel dispatch.
The ChunkBuffer
requires an Eigen type (ColType
)
as its first template parameter.
Here, users provide the fixed-size type closest to what they’d need
in a single loop iteration,
e.g., for the example above,
an Eigen::Array<double, 1, 1>
(a double
array with a single element).
This allows users to define
whether they require Eigen::Array
or Eigen::Matrix
behaviour.
The second template parameter is the Target
enumeration.
On host
, a ChunkBuffer
contains one array per thread
(using omp_get_max_threads
),
with the same number of rows as ColType
,
and the chunk size of the parallel loop as the number of columns.
A reasonable default is chosen for the chunk size,
but it can also optionally be specified via a Kokkos::RangePolicy
.
This improves the likelihood of the buffer data remaining in cache.
On device
, a ChunkBuffer
contains no data.
Instead, it only wraps ColType
.
When accessing this buffer inside a parallel dispatch,
the type gets instantiated — and because fixed-size types are allocated on the stack,
this is equivalent to defining stack variables in a C-style loop.
Usage:
ViewMap<Eigen::ArrayXd> arr {size};
/* number of intermediate values = number of rows */
constexpr int numVals {1};
using Array1d = Eigen::Array<double, numVals, 1>;
/* create buffer */
auto chunkBuf = makeBuffer<Array1d, target>(size);
parallel_for_chunks<target>(size, KOKKOS_LAMBDA(EigenRange<target> chunk){
/* access buffer inside parallel dispatch */
auto buf = getBuffer(chunkBuf, chunk);
/* use the EigenRange "chunk" like a ParallelRange */
buf = chunk(arr).pow(0.3);
chunk(arr) = buf + 1 / buf;
});
The dispatch function parallel_for_chunks
is a shortcut for the following:
parallel_for<target>(size, KOKKOS_LAMBDA(ParallelRange<target> rng){
rng.for_each_chunk( [&](auto chunk){
/* same body */
});
});
See EigenRange
for the data type of the chunk
parameter.
The core of the ViewMap
class (see file)
are the two member functions map()
and view()
,
which return an Eigen::Map
, and a Kokkos::View
respectively,
and thus allow it to be used in either library’s functions.
ViewMap
takes two template parameters:
-
EigenType
: TheEigen
class to be used as the map type, e.g.Eigen::MatrixXd
orEigen::Array3i
. The return type ofmap()
behaves the same way as this type. Only dense types are currently supported. -
A
Target
enumeration value, which can be eitherhost
ordevice
. This parameter is optional. Its default value matchesKokkos::DefaultExecutionSpace
.
ViewMap
can be instantiated either using an existing Eigen
object,
or using the same size parameters as you would for the Eigen
type.
Here’s what happens when you create a ViewMap
:
-
With an existing
Eigen
object:-
Instantiation on
Target::host
: No allocation is performed. An unmanagedKokkos::View
is created, using the existing object’s data pointer and sizes. -
Instantiation on
Target::device
: theEigen
object’s sizes are used to create a matching managedKokkos::View
on the device.
-
-
With size parameters: A managed
Kokkos::View
is created using these sizes onTarget
. The same size parameters are allowed as for the respectiveEigen
type. This means, creating vector types (1D) requires only a single size parameter, and fixed size types can be created without them.
In all of the above cases, the data pointers of view()
and map()
contain the same address.
Furthermore, when instantiating a ViewMap
with
a non-const, owning Eigen
object (i.e. not itself an Eigen::Map
),
a non-owning pointer to the object is stored
to allow resizing both the Kokkos::View
and the Eigen
object
via ViewMap::resize()
.
Expand ViewMap examples
(taken from examples/ViewMap.cpp)
using namespace Kokkidio;
int nRows {10}, nCols {20};
/* existing Eigen object */
Eigen::ArrayXXd eigenArray {nRows, nCols};
/* Create ViewMap using a constructor or factory function.
* Deduces Eigen type, and uses default target */
ViewMap mv1 {eigenArray};
auto mv2 = viewMap(eigenArray);
/* Create ViewMap using factory function for specific target,
* while deducing Eigen type */
auto mv3 = viewMap<Target::host>(eigenArray);
/* Create ViewMap using size parameters.
* ArrayXXd is dynamically sized in both dimensions,
* so two parameters are required */
ViewMap<Eigen::ArrayXXd> mv4 {nRows, nCols};
/* ArrayXd is a column vector, so only rows are required */
ViewMap<Eigen::ArrayXd> mv5 {nRows};
/* Array3d is a fixed size type, so no parameters are required */
ViewMap<Eigen::Array3d> mv6;
/* set values on host, using Eigen's assignment operator on ViewMap::map() */
mv1.map() = 1;
/* set values on target, using Kokkos::deep_copy with ViewMap::view() */
Kokkos::deep_copy(mv2.view(), 2);
/* set values on target with parallel dispatch: */
/* with Kokkidio::ParallelRange */
parallel_for( mv3.cols(), KOKKOS_LAMBDA(ParallelRange<> rng){
rng(mv3) = 3;
});
/* or just an integer, using the standard Kokkos-style */
parallel_for( mv4.size(), KOKKOS_LAMBDA(int i){
mv4.data()[i] = 4;
});
Expand synopsis of ViewMap
template<typename _EigenType, Target targetArg = DefaultTarget>
class ViewMap {
public:
static constexpr Target target { ExecutionTarget<targetArg> };
using EigenType_host = _EigenType;
/* EigenType_host and EigenType_target may differ in const-ness */
using EigenType_target = std::conditional_t<target == Target::host,
EigenType_host,
std::remove_const_t<EigenType_host>
>;
using ThisType = ViewMap<EigenType_target, target>;
using Scalar = typename EigenType_target::Scalar;
using MapType = Eigen::Map<EigenType_host>;
/* only types with a continuous memory layout are currently supported */
static_assert( is_contiguous<EigenType_target>() );
/* Translations of "target" into Kokkos spaces */
using MemorySpace = Kokkidio::MemorySpace <target>;
using ExecutionSpace = Kokkidio::ExecutionSpace<target>;
/* The Kokkos::View data type is either fully dynamic or fully fixed-size,
* i.e. Scalar** or Scalar[nRows][nCols],
* and always uses LayoutLeft */
using ViewType = Kokkos::View<..., Kokkos::LayoutLeft, MemorySpace>;
using HostMirror = typename ViewType::HostMirror;
public:
/* constructors */
ViewMap(); // default, allocation only for fixed size types
ViewMap(Index size); // 1D types
ViewMap(Index rows, Index cols); // 2D types
ViewMap( _EigenType& hostObj ); // existing Eigen objects
/* "resize" and constructors can only be called from host */
void resize(Index rows, Index cols);
/* get some info about type and status */
KOKKOS_FUNCTION constexpr bool isManaged() const;
KOKKOS_FUNCTION bool isAlloc() const;
/* data pointer */
KOKKOS_FUNCTION Scalar* data();
KOKKOS_FUNCTION const Scalar* data() const;
/* get Eigen::Map */
KOKKOS_FUNCTION MapType map() const;
/* and Kokkos::View */
KOKKOS_FUNCTION ViewType view() const;
/* sizes */
KOKKOS_FUNCTION Index rows() const;
KOKKOS_FUNCTION Index cols() const;
KOKKOS_FUNCTION Index size() const;
};
/* detection */
template<typename T>
inline constexpr bool is_ViewMap_v = ...;
/* factory functions */
/* specify target, deduce EigenType */
template<Target target = DefaultTarget, typename EigenType>
ViewMap<EigenType, target> viewMap( EigenType& eigenObj );
/* specify EigenType, optionally specify target, fixed size */
template<typename EigenType, Target target = DefaultTarget>
ViewMap<EigenType, target> viewMap();
/* specify EigenType, optionally specify target, 1D */
template<typename EigenType, Target target = DefaultTarget>
ViewMap<EigenType, target> viewMap(Index vectorSize);
/* specify EigenType, optionally specify target, 2D */
template<typename EigenType, Target target = DefaultTarget>
ViewMap<EigenType, target> viewMap(Index rows, Index cols);
DualViewMap
(see file)
is designed to facilitate easy data exchange between host
and the compute Target
.
To this end, it provides the member functions
copyToTarget()
and
copyToHost()
.
It takes the same template parameters as ViewMap
,
i.e. an Eigen
type, and a Target
value.
While a ViewMap
only exists on either host
or device
,
DualViewMap
always consists of two ViewMap
s,
of which one is located on host
,
and the other on the specified Target
.
If the Target
is also host
, then the two views are identical,
and copyTo…()
operations are correspondingly skipped.
To access the ViewMap
s, it provides the member functions
get_host()
and
get_target()
,
as well as shortcuts to their map()
/view()
member functions
in the form of
map_host()
/map_target()
and view_host()
/view_target()
.
Similar to ViewMap
, it also allows to resize()
its data,
and does so on both host
and the specified Target
.
Expand DualViewMap examples
(taken from examples/DualViewMap.cpp)
using namespace Kokkidio;
int nRows {10}, nCols {20};
/* existing Eigen object */
Eigen::ArrayXXd eigenArray {nRows, nCols};
/* By default, when initialising with an Eigen object,
* the object's data is copied to the target.
* This behaviour be changed with an optional parameter: DontCopyToTarget */
DualViewMap d1 {eigenArray};
auto d2 = dualViewMap(eigenArray, DontCopyToTarget);
/* Otherwise, a DualViewMap can be created in exactly the same ways as a
* ViewMap, so please refer to ViewMap.cpp for more examples. */
/* with DualViewMap, you can set your values on host,
* then copy them to the target: */
d2.map_host() = 123;
d2.copyToTarget();
auto print = [&](std::string_view descriptor){
std::cout
<< "d2, values on host, " << descriptor << ":\n"
<< d2.map_host() << '\n';
};
print("before");
/* Now you can do some computations on the target,
* then copy the values back */
parallel_for(d2.cols(), KOKKOS_LAMBDA(ParallelRange<> rng){
rng(d2) += 1;
});
d2.copyToHost();
print("after");
Expand synopsis of DualViewMap
template<typename _EigenType, Target targetArg = DefaultTarget>
class DualViewMap {
public:
static constexpr Target target { ExecutionTarget<targetArg> };
using EigenType_host = _EigenType;
using ThisType = DualViewMap<EigenType_host, target>;
using ViewMap_host = ViewMap<EigenType_host, Target::host>;
using ViewMap_target = ViewMap<EigenType_host, target>;
using EigenType_target = typename ViewMap_target::EigenType_target;
using Scalar = typename ViewMap_target::Scalar;
using ViewType_host = typename ViewMap_host ::ViewType;
using ViewType_target = typename ViewMap_target::ViewType;
using ExecutionSpace_target = typename ViewMap_target::ExecutionSpace;
using MapType_host = typename ViewMap_host ::MapType;
using MapType_target = typename ViewMap_target::MapType;
public:
/* constructors */
DualViewMap(); // default, allocation only for fixed size types
DualViewMap(Index size); // 1D types
DualViewMap(Index rows, Index cols); // 2D types
DualViewMap(
EigenType_host& hostObj,
DualViewCopyOnInit copyToTarget = CopyToTarget
); // existing Eigen objects
/* "assign", "resize" and constructors can only be called from host */
void assign( EigenType_host& hostObj );
void resize(Index rows, Index cols);
/* get some info about type and status */
KOKKOS_FUNCTION bool isAlloc_host() const;
KOKKOS_FUNCTION bool isAlloc_target() const;
/* get ViewMaps */
KOKKOS_FUNCTION ViewMap_host get_host () const;
KOKKOS_FUNCTION ViewMap_target get_target() const;
template<Target _target>
KOKKOS_FUNCTION auto get() const
-> std::conditional<_target == target, ViewMap_target, ViewMap_host>;
/* get Kokkos::Views */
KOKKOS_FUNCTION ViewType_host view_host () const;
KOKKOS_FUNCTION ViewType_target view_target() const;
template<Target _target>
KOKKOS_FUNCTION auto view() const
-> std::conditional<_target == target, ViewType_target, ViewType_host>;
/* shortcut to view_target */
KOKKOS_FUNCTION ViewType_target view() const;
/* get Eigen::Maps */
KOKKOS_FUNCTION MapType_host map_host () const;
KOKKOS_FUNCTION MapType_target map_target() const;
template<Target _target>
KOKKOS_FUNCTION auto map() const
-> std::conditional<_target == target, MapType_target, MapType_host>;
/* shortcut to map_target */
KOKKOS_FUNCTION MapType_target map() const;
/* sizes */
KOKKOS_FUNCTION Index rows() const;
KOKKOS_FUNCTION Index cols() const;
KOKKOS_FUNCTION Index size() const;
/* copy */
void copyToTarget(bool async = false);
void copyToHost(bool async = false);
};
/* detection */
template<typename T>
inline constexpr bool is_DualViewMap_v = ...;
/* factory functions */
/* specify target, deduce EigenType */
template<Target target = DefaultTarget, typename EigenType>
DualViewMap<EigenType, target> dualViewMap(
EigenType& eigenObj,
DualViewCopyOnInit copyToTarget = CopyToTarget
);
/* specify EigenType, optionally specify target, fixed size */
template<typename EigenType, Target target = DefaultTarget>
DualViewMap<EigenType, target> dualViewMap();
/* specify EigenType, optionally specify target, 1D */
template<typename EigenType, Target target = DefaultTarget>
DualViewMap<EigenType, target> dualViewMap(Index vectorSize);
/* specify EigenType, optionally specify target, 2D */
template<typename EigenType, Target target = DefaultTarget>
DualViewMap<EigenType, target> dualViewMap(Index rows, Index cols);
IndexRange
is a small, but expressive helper class
intended to unify the different ways
in which ranges of discrete items are described in Eigen and Kokkos:
Eigen’s block operations
use the first index and the number of items,
while Kokkos (e.g. RangePolicy
) describe ranges
using begin
and end
indices.
Use it to define the work range (first argument) in a parallel dispatch, if your work range does not start at zero.
Expand IndexRange
synopsis
/* empty structs used in IndexRange constructors */
struct LimitIsEnd {};
struct LimitIsSize {};
template<typename _Integer>
class IndexRange {
public:
using Integer = _Integer;
/* synonyms to match Kokkos */
using index_type = Integer;
using value_type = Integer;
/* the actual data member */
struct Aggregate {
Integer start, size;
} values;
public:
/* getter */
KOKKOS_FUNCTION Integer start() const;
KOKKOS_FUNCTION Integer size () const;
KOKKOS_FUNCTION Integer begin() const; // synonym for "start"
KOKKOS_FUNCTION Integer count() const; // synonym for "size"
KOKKOS_FUNCTION Integer end () const;
/* setter */
KOKKOS_FUNCTION void start(Integer);
KOKKOS_FUNCTION void size (Integer);
KOKKOS_FUNCTION void begin(Integer); // synonym for "start"
KOKKOS_FUNCTION void count(Integer); // synonym for "size"
KOKKOS_FUNCTION void end (Integer);
template<typename LimitType = LimitIsSize>
KOKKOS_FUNCTION void set(Integer start, Integer limit);
/* constructors */
KOKKOS_FUNCTION IndexRange() = default;
KOKKOS_FUNCTION IndexRange(Integer size); // sets "start" to zero
/* can be used with
* 1) size as second arg -> third arg can be LimitIsSize{} or omitted
* 2) end as second arg -> third arg must be LimitIsEnd{} */
template<typename LimitType = LimitIsSize>
KOKKOS_FUNCTION IndexRange(Integer start, Integer limit, LimitType = {});
/* conversion from other integral types */
template<typename OtherInt, SFINAE...>
KOKKOS_FUNCTION IndexRange( const IndexRange<OtherInt>& other );
};
EigenRange
contains a data member, whose type is
IndexRange
on host
, and a single Index
on device
.
It
provides functions for getting an Eigen::Block
expression
from an Eigen or (Dual)ViewMap
object,
whose extents conform to the data member, e.g.,
a contiguous range of rows/columns on host
,
or a single row/column on device
.
It is the base class of ParallelRange
,
as well as
the functor parameter type in parallel_for_chunks
and ParallelRange::for_each_chunk
(see (Chunk) buffers).
Expand EigenRange
synopsis
template<Target _target>
class EigenRange {
public:
static constexpr Target target {_target};
static constexpr bool isDevice {target == Target::device};
static constexpr bool isHost {target == Target::host};
using MemberType = std::conditional_t<isHost, IndexRange<Index>, int>;
protected:
MemberType m_rng;
public:
KOKKOS_FUNCTION EigenRange() = default;
KOKKOS_FUNCTION EigenRange( MemberType );
KOKKOS_FUNCTION const MemberType& get() const;
KOKKOS_FUNCTION MemberType& get();
KOKKOS_FUNCTION IndexRange<Index> asIndexRange() const;
template<typename EigenObj>
KOKKOS_FUNCTION Eigen::Block<...> colRange( EigenObj&& obj ) const;
template<typename EigenObj>
KOKKOS_FUNCTION Eigen::Block<...> rowRange( EigenObj&& obj ) const;
template<typename EigenObj>
KOKKOS_FUNCTION Eigen::Block<...> range( EigenObj&& obj ) const;
/* effectively the same as range(...) */
template<typename EigenObj>
KOKKOS_FUNCTION Eigen::Block<...> operator() ( EigenObj&& obj ) const;
};
The multi-backend nature of Kokkos and, by extension, Kokkidio,
makes the process of configuring and building rather involved.
To help with this, the build script build.sh
is included.
On first run, it
will create a file in env/nodes
,
whose name is your machine’s node name, i.e., the output of uname -n
,
plus the ending .sh
.
On subsequent runs, this node file is read by build.sh
.
In the node file, the
device
architecture and programming model ("backend") must be specified
by the user:
Expand node file example
See https://kokkos.org/kokkos-core-wiki/keywords.html#architectures
for possible values of Kokkos_ARCH
.
###########
# Required:
###########
# uncomment one of these
backend_default=cuda
# backend_default=hip
# backend_default=sycl
# backend_default=ompt # for using OpenMP target offloading
Kokkos_ARCH=Kokkos_ARCH_...
###########
# Optional:
###########
# You can set paths to Kokkos and Eigen. The script can download them for you,
# and by setting [Eigen|Kokkos]_SRC, you can control the destination.
Eigen_SRC=./lib/eigen
Kokkos_SRC=./lib/kokkos
# Similarly, [Eigen|Kokkos]_BUILD controls where they will be built
Eigen_BUILD=$Eigen_SRC/build
Kokkos_BUILD=$Kokkos_SRC/build
# The install path is set with [Eigen|Kokkos]_INST. If an existing installation
# is found there, then the download, build, and install steps are skipped.
# This is tested by the script by looking for the respective CMake package file.
Eigen_INST="$Eigen_SRC/install"
Kokkos_INST="$Kokkos_SRC/install"
# any other code you want to run before building comes here ...
You need a compiler that matches the chosen backend. See Kokkos: compiler versions for details.
The node file approach was chosen to accomodate HPC systems, where code is often run on a different machine to where it was configured/compiled, thus making autodetection unreliable. To allow for the same node file to be used across multiple machines, e.g., login/compute nodes, you can define conditions/patterns in env/node_patterns.sh, e.g.
if [[ $node_name == *"loginNode"* ]]; then
node_name="myNodeFile" # will make build.sh read env/nodes/myNodeFile.sh
fi
If any additional code needs to be run before configuring building, e.g. loading modules or setting a specific compiler, that can be done in the node file as well.
After setting up the node file,
run ./build.sh -h
to see the available options.
To download, compile, and install all components, you may use
./build.sh -cdi all
.
This section assumes that you already have installed Kokkos and Eigen.
Kokkidio needs to be able to find their respective CMake packages.
If they cannot be found automatically,
you can provide the environment or CMake variables
Kokkos_ROOT
, and Eigen_ROOT
or Eigen3_ROOT
.
See CMake’s find_package
doc for details on these paths.
Kokkidio follows the general way of configuring and installing a CMake project, so you could use build commands such as this:
cmake \
-S . -B ./build --prefix ./install \
-DCMAKE_CXX_EXTENSIONS=Off \
-D Eigen_ROOT=/path/to/eigen \
-D Kokkos_ROOT=/path/to/kokkos
cmake --install ./build
To build the examples (for tests, change examples
to tests
),
you can then use
cmake \
-S ./src/examples -B /build/examples \
-DCMAKE_BUILD_TYPE=Release \
-D Kokkidio_ROOT=./install
Kokkidio creates a CMake configuration file, so that it can be found with
find_package(Kokkidio)
If you didn’t install Kokkidio in a standard directory,
then you need to provide the CMake or environment variable Kokkidio_ROOT
.
It should point to the installation directory,
i.e. the one containing the directories include
and lib
.
Due to some trickery that Kokkidio has to apply to source files,
you need to use the function kokkidio_configure_target
on your CMake targets,
e.g.
cmake_minimum_required(VERSION 3.21 FATAL_ERROR)
project(MyProject)
find_package(Kokkidio REQUIRED)
add_executable(myTarget "")
target_sources(myTarget PRIVATE
myFile1.cpp
myFile2.cpp
)
# *after* adding source files to your target, configure it with Kokkidio
kokkidio_configure_target(myTarget)
For reasons detailed in CPU-Vectorisation with CUDA backend, we recommend setting up separate
CPU and device code files. These can #include
the same unified code file,
as detailed in that section. Then, you can declare CPU files in CMake
using Kokkidio's function set_is_cpu
, e.g.:
set_is_cpu(myFile1.cpp)
To achieve CPU parallelism with Kokkidio,
Kokkos should always be built with the OpenMP backend enabled, i.e.
KOKKOS_ENABLE_OPENMP=ON
.
By default, Eigen disables vectorisation, when __CUDACC__
is defined,
i.e., when nvcc
is used as the compiler for non-.cu
-files.
The single-source approach of Kokkos aims to
eliminate the need for backend-specific files,
thus combining Eigen and Kokkos lets this issue emerge.
The fix Kokkidio uses, is to separate CPU and non-CPU translation units,
and then defining EIGEN_NO_CUDA
for the CPU unit.
For convenience, the CMake function set_is_cpu
is provided for this purpose.
Here’s how to do this in practice:
-
Put your source code into some file, and don’t specify it as a source file in CMake. We recommend the file ending
.in
for this. Let’s call itsource.in
-
Create two additional files, one each for host and device compilation, respectively. Let’s call these
source_host.cpp
andsource_device.cpp
. Add these files to your CMake target, and add the lineset_is_cpu(source_host.cpp)
-
#include
the source file (source.in
) in both of these files. -
optionally, you can write your functions as templates of the
Target
parameter of all Kokkidio classes, and explicitly instantiate them in those files.
Here’s a full example:
source.in
#include <Kokkidio.hpp>
template<Target target>
float getSum( const ViewMap<ArrayXf, target>& vm ){
float sum_global;
auto func = KOKKOS_LAMBDA(ParallelRange<target> rng, float& sum_thread){
sum_thread += rng(vm).sum();
};
parallel_reduce( vm.size(), func, redux::sum(sum_global) );
return sum_global;
}
source_host.cpp
#include "source.in"
/* explicit instantiation for host */
template float getSum<Target::host>( const ViewMap<ArrayXf, Target::host>& );
source_device.cpp
#include "source.in"
/* explicit instantiation for device */
template float getSum<Target::device>( const ViewMap<ArrayXf, Target::device>& );
CMakeLists.txt
add_library(myTarget
source_host.cpp
source_device.cpp
)
set_is_cpu(source_host.cpp)
kokkidio_configure_target(myTarget)
SYCL, at least in its oneAPI flavour,
does not support parallel host execution on any non-Intel CPU.
Therefore, Kokkidio defaults to redirecting parallel_for
calls
on Target::host
to OpenMP.
This behaviour is controlled through the preprocessor symbol
KOKKIDIO_SYCL_DISABLE_ON_HOST
.
The name Kokkidio is based on the assumptions that
The latter are ιδιοτιμή (idiotimí) and ιδιοδιάνυσμα (idiodiánysma) in Greek, from which the prefix ιδιο (idio) was taken (engl.: same, though it could also be from ίδιος = own, or self, which is the meaning of eigen in German). κοκκίδιο (kokkídio) could be seen as a portmanteau of Kokkos and idio, but is in fact the Greek word for granule, so not far off Kokkos itself.
The logo is a stretched/sheared map of a recolouration of the Kokkos logo, with the eigenvectors of that mapping drawn as arrows.
Kokkidio is maintained by the Chair of Water Resources Management and Modelling of Hydrosystems of the Technische Universität Berlin, or wahyd for short (Link). It is distributed under a GPLv3 (Licence text). Licence types for the libraries used in Kokkidio are listed in the LICENCE.README file.